terazi works by sampling a dataset while searching for an optimal balance structure. Given the original data collection \(D\), the goal is to construct a sampled dataset \(D'\) which has a certain target balance structure. The sampled collection \(D'\) is composed of the following four partitions:
- \(p\_f':\) privileged favourable samples
- \(p\_uf':\) privileged unfavourable samples
- \(up\_f':\) unprivileged favourable samples
- \(up\_uf':\) unprivileged unfavourable samples
The balance structure of sampling \(D'\) is controlled with 3 parameters, \(\alpha\), \(\beta\), and \(\gamma\). Each parameter can have a value between \(0-1\), and represent the distribution of a specific subgroup in \(D'\). The subgroups they control are as follows:
- Parameter \(\alpha:\) controls the unprivileged group rate within \(D'\).
- Parameter \(\beta:\) controls the unfavourable labelled instance rate within the unprivileged group.
- Parameter \(\gamma:\) controls the unfavourable labelled instance rate within the privileged group.
Sampling parameters are also controlled with a level option for search depth, where their values are updated with different intervals as follows:
- Level 0: Parameter values are selected by \(0.1\) length intervals between \(0-1\).
- Level 1: Parameter values are selected by \(0.01\) length intervals between \(0-1\).
Each sampling \(D'\) is used to train a selected classifier, while optimizing the performance according to the following loss function:
$$ loss = (1-\text{DI_Ratio}) + (1-\text{MCC}) $$
where DI_Ratio is Disparate Impact Ratio metric which is formulized as:
$$ \text{DI_Ratio} = \frac{P(L=\text{unfavourable} | G=\text{unprivileged})}{P(L=\text{unfavourable} | G=\text{unprivileged})} $$
and MCC is a common performance metric used when the class labels are imbalanced which is formulized as:
$$ \text{MCC} = \frac{TP * TN - FP * FN}{\sqrt{(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)}} $$