Datasets¶

Collection of common benchmark datasets from fairness research.

Each dataset object contains a pandas.DataFrame as df attribute that holds the actual data. The dataset object will take care of loading, preprocessing and validating the data. The preprocessing is done by standard practices that are associated with this data set: from its manual (e.g., README) or as other did in the literature.

See responsibly.dataset.Dataset for additional attribute and complete documentation.

Currently these are the available datasets:

ProPublica recidivism/COMPAS dataset, see: COMPASDataset

Adult dataset, see: AdultDataset

German credit dataset, see: GermanDataset

FICO credit score dataset, see build_FICO_dataset()

Usage¶

>>> from responsibly.dataset import COMPASDataset
>>> compas_ds = COMPASDataset()
>>> print(compas_ds)
<ProPublica Recidivism/COMPAS Dataset. 6172 rows, 56 columns in
which {race, sex} are sensitive attributes>
>>> type(compas_ds.df)
<class 'pandas.core.frame.DataFrame'>
>>> compas_ds.df['race'].value_counts()
African-American    3175
Caucasian           2103
Hispanic             509
Other                343
Asian                 31
Native American       11
Name: race, dtype: int64

General Dataset¶

class responsibly.dataset.Dataset(target, sensitive_attributes, prediction=None)[source]¶

Base class for datasets.

Attributes

df - pandas.DataFrame that holds the actual data.
target - Column name of the variable to predict
(ground truth)
sensitive_attributes - Column name of the
sensitive attributes
prediction - Columns name of the
prediction (optional)

Available Datasets¶

class responsibly.dataset.COMPASDataset[source]¶

ProPublica Recidivism/COMPAS Dataset.

See Dataset for a description of the arguments and attributes.

References:: https://github.com/propublica/compas-analysis

class responsibly.dataset.AdultDataset[source]¶

Adult Dataset.

See Dataset for a description of the arguments and attributes.

References:: https://archive.ics.uci.edu/ml/datasets/adult

class responsibly.dataset.GermanDataset[source]¶

German Credit Dataset.

See Dataset for a description of the arguments and attributes.

References:

https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
Kamiran, F., & Calders, T. (2009, February). Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication (pp. 1-6). IEEE. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.6067&rep=rep1&type=pdf

Extra

This dataset requires use of a cost matrix (see below)

   1 2
   ----
1 | 0 1
  |----
2 | 5 0

(1 = Good, 2 = Bad)

The rows represent the actual classification and the columns the predicted classification. It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).

FICO Dataset¶

responsibly.dataset.build_FICO_dataset()[source]¶

Build the FICO dataset.

Dataset of the credit score of TransUnion (called TransRisk). The TransRisk score is in turn based on a proprietary model created by FICO, hence often referred to as FICO scores.

The data is aggregated, i.e., there is no outcome and prediction information per individual, but summarized statistics for each FICO score and race/race/ethnicity group.

FICO key	Meaning
total	Total number of individuals
totals	Number of individuals per group
cdf	Cumulative distribution function of score per group
pdf	Probability distribution function of score per group
performance	Fraction of non-defaulters per score and group
base_rates	Base rate of non-defaulters per group
base_rate	The overall base rate non-defaulters
proportions	Fraction of individuals per group
fpr	True Positive Rate by score as threshold per group
tpr	False Positive Rate by score as threshold per group
rocs	ROC per group
aucs	ROC AUC per group

Returns: Dictionary of various aggregated statics of the FICO credit score.
Return type: dict

References:

Based on code (MIT License) by Moritz Hardt from https://github.com/fairmlbook/fairmlbook.github.io
https://fairmlbook.org/demographic.html#case-study-credit-scoring

Datasets¶

Usage¶

General Dataset¶

Available Datasets¶

FICO Dataset¶

Responsibly

Navigation

Related Topics