terazi - AI Fairness for Doubly Imbalanced Data

A multi-criteria based solution for training classifiers in terms of fairness and performance

What sets terazi apart?

Terazi Logo

terazi is an AI fairness tool that finds an optimal model given a dataset. What sets terazi apart from the existing methods is that it specialises on doubly imbalanced datasets, where both the unfavourable label distributions and the privileged group distributions are skewed, rather than every population having similar proportions. With its sampling algorithm and multi-criteria optimization, terazi ensures that classification performance is maximized under fair conditions.

Demo Datasets

BAF

Bank Account Fraud dataset generated from real world data to preserve privacy and increase instance count (Jesus et al., 2022) - base version.

Try demo »

CCF

Credit Card Fraud dataset from Kaggle.

Try demo »

VIF

Vehicle Insurance Fraud dataset from Kaggle.

Try demo »

How does terazi work?

terazi works by sampling a dataset while searching for an optimal balance structure. Given the original data collection \(D\), the goal is to construct a sampled dataset \(D'\) which has a certain target balance structure. The sampled collection \(D'\) is composed of the following four partitions:

  • \(p\_f':\) privileged favourable samples
  • \(p\_uf':\) privileged unfavourable samples
  • \(up\_f':\) unprivileged favourable samples
  • \(up\_uf':\) unprivileged unfavourable samples
The balance structure of sampling \(D'\) is controlled with 3 parameters, \(\alpha\), \(\beta\), and \(\gamma\). Each parameter can have a value between \(0-1\), and represent the distribution of a specific subgroup in \(D'\). The subgroups they control are as follows:

  • Parameter \(\alpha:\) controls the unprivileged group rate within \(D'\).
  • Parameter \(\beta:\) controls the unfavourable labelled instance rate within the unprivileged group.
  • Parameter \(\gamma:\) controls the unfavourable labelled instance rate within the privileged group.
Sampling parameters are also controlled with a level option for search depth, where their values are updated with different intervals as follows:
  • Level 0: Parameter values are selected by \(0.1\) length intervals between \(0-1\).
  • Level 1: Parameter values are selected by \(0.01\) length intervals between \(0-1\).
Each sampling \(D'\) is used to train a selected classifier, while optimizing the performance according to the following loss function: $$ loss = (1-\text{DI_Ratio}) + (1-\text{MCC}) $$ where DI_Ratio is Disparate Impact Ratio metric which is formulized as: $$ \text{DI_Ratio} = \frac{P(L=\text{unfavourable} | G=\text{unprivileged})}{P(L=\text{unfavourable} | G=\text{unprivileged})} $$ and MCC is a common performance metric used when the class labels are imbalanced which is formulized as: $$ \text{MCC} = \frac{TP * TN - FP * FN}{\sqrt{(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)}} $$