Datasets¶
Collection of common benchmark datasets from fairness research.
Each dataset object contains a pandas.DataFrame
as df attribute
that holds the actual data.
The dataset object will take care of loading, preprocessing
and validating the data.
The preprocessing is done by standard practices that are associated with
this data set: from its manual (e.g., README)
or as other did in the literature.
See responsibly.dataset.Dataset
for additional attribute and complete documentation.
Currently these are the available datasets:
ProPublica recidivism/COMPAS dataset, see:
COMPASDataset
Adult dataset, see:
AdultDataset
German credit dataset, see:
GermanDataset
FICO credit score dataset, see
build_FICO_dataset()
Usage¶
>>> from responsibly.dataset import COMPASDataset
>>> compas_ds = COMPASDataset()
>>> print(compas_ds)
<ProPublica Recidivism/COMPAS Dataset. 6172 rows, 56 columns in
which {race, sex} are sensitive attributes>
>>> type(compas_ds.df)
<class 'pandas.core.frame.DataFrame'>
>>> compas_ds.df['race'].value_counts()
African-American 3175
Caucasian 2103
Hispanic 509
Other 343
Asian 31
Native American 11
Name: race, dtype: int64
General Dataset¶
-
class
responsibly.dataset.
Dataset
(target, sensitive_attributes, prediction=None)[source]¶ Base class for datasets.
- Attributes
df -
pandas.DataFrame
that holds the actual data.- target - Column name of the variable to predict
(ground truth)
- sensitive_attributes - Column name of the
sensitive attributes
- prediction - Columns name of the
prediction (optional)
Available Datasets¶
-
class
responsibly.dataset.
COMPASDataset
[source]¶ ProPublica Recidivism/COMPAS Dataset.
See
Dataset
for a description of the arguments and attributes.
-
class
responsibly.dataset.
AdultDataset
[source]¶ Adult Dataset.
See
Dataset
for a description of the arguments and attributes.
-
class
responsibly.dataset.
GermanDataset
[source]¶ German Credit Dataset.
See
Dataset
for a description of the arguments and attributes.- References:
https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
Kamiran, F., & Calders, T. (2009, February). Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication (pp. 1-6). IEEE. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.6067&rep=rep1&type=pdf
- Extra
This dataset requires use of a cost matrix (see below)
1 2 ---- 1 | 0 1 |---- 2 | 5 0
(1 = Good, 2 = Bad)
The rows represent the actual classification and the columns the predicted classification. It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).
FICO Dataset¶
-
responsibly.dataset.
build_FICO_dataset
()[source]¶ Build the FICO dataset.
Dataset of the credit score of TransUnion (called TransRisk). The TransRisk score is in turn based on a proprietary model created by FICO, hence often referred to as FICO scores.
The data is aggregated, i.e., there is no outcome and prediction information per individual, but summarized statistics for each FICO score and race/race/ethnicity group.
FICO key
Meaning
total
Total number of individuals
totals
Number of individuals per group
cdf
Cumulative distribution function of score per group
pdf
Probability distribution function of score per group
performance
Fraction of non-defaulters per score and group
base_rates
Base rate of non-defaulters per group
base_rate
The overall base rate non-defaulters
proportions
Fraction of individuals per group
fpr
True Positive Rate by score as threshold per group
tpr
False Positive Rate by score as threshold per group
rocs
ROC per group
aucs
ROC AUC per group
- Returns
Dictionary of various aggregated statics of the FICO credit score.
- Return type
- References:
Based on code (MIT License) by Moritz Hardt from https://github.com/fairmlbook/fairmlbook.github.io
https://fairmlbook.org/demographic.html#case-study-credit-scoring