Jurity Public API

Binary Classification Metrics

class jurity.classification.BinaryClassificationMetrics

Bases: NamedTuple

class AUC

Bases: object

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from predicted likelihoods.

Parameters:

actual (Union[List, np.ndarray, pd.Series]) – Binary ground truth (correct) labels (0/1).
likelihoods (Union[List, np.ndarray, pd.Series]) – Predicted likelihoods, as returned by a classifier.
sample_weight (Union[List, np.ndarray, pd.Series]) – Sample weights.

Return type:

Recall score.

class Accuracy

Bases: object

Calculates accuracy score as the fraction of correctly classified samples.

Parameters:

actual (Union[List, np.ndarray, pd.Series]) – Binary ground truth (correct) labels (0/1).
predicted (Union[List, np.ndarray, pd.Series]) – Binary predicted labels, as returned by a classifier (0/1).
sample_weight (Union[List, np.ndarray, pd.Series]) – Sample weights.

Return type:

Accuracy score.

class F1

Bases: object

Compute the F1 score, also known as balanced F-score or F-measure.

The F1 score is a weighted average of precision and recall, with equal relative contribution. The best value is 1 and the worst value is 0.

The formula for the F1 score is::: F1 = 2 * (precision * recall) / (precision + recall)

Parameters:

actual (Union[List, np.ndarray, pd.Series]) – Binary ground truth (correct) labels (0/1).
predicted (Union[List, np.ndarray, pd.Series]) – Binary predicted labels, as returned by a classifier (0/1).
sample_weight (Union[List, np.ndarray, pd.Series]) – Sample weights.

Return type:

Recall score.

class Precision

Bases: object

Calculates precision.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The best value is 1 and the worst value is 0.

Parameters:

actual (Union[List, np.ndarray, pd.Series]) – Binary ground truth (correct) labels (0/1).
predicted (Union[List, np.ndarray, pd.Series]) – Binary predicted labels, as returned by a classifier (0/1).
sample_weight (Union[List, np.ndarray, pd.Series]) – Sample weights.

Return type:

Precision score.

class Recall

Bases: object

Calculates recall.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.

Parameters:

actual (Union[List, np.ndarray, pd.Series]) – Binary ground truth (correct) labels (0/1).
predicted (Union[List, np.ndarray, pd.Series]) – Binary predicted labels, as returned by a classifier (0/1).
sample_weight (Union[List, np.ndarray, pd.Series]) – Sample weights.

Return type:

Recall score.

Binary Fairness Metrics

class jurity.fairness.BinaryFairnessMetrics

Bases: NamedTuple

Class containing a variety of fairness metrics for binary classification.

class AverageOdds

Bases: _BaseBinaryFairness

The average odds denote the average of difference in FPR and TPR for group 1 and group 2.

\[\frac{1}{2} [(FPR_{D = \text{group 1}} - FPR_{D = \text{group 2}}) + (TPR_{D = \text{group 2}} - TPR_{D = \text{group 1}}))]\]

If predictions within ANY group are homogeneous, we cannot calculate some of the performance measures (such as TPR,TNR,FPR,FNR), in this case, NaN is returned.

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).
membership_label (Union[str, float, int]) – Value indicating group membership. Default value is 1.

Return type:

Average odds difference between groups.

DisparateImpact: alias of BinaryDisparateImpact

class EqualOpportunity

Bases: _BaseBinaryFairness

Calculate the ratio of true positives to positive examples in the dataset, \(TPR = TP/P\), conditioned on a protected attribute.

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).
membership_label (Union[str, float, int]) – Value indicating group membership. Default value is 1.

Return type:

Equal opportunity difference between groups.

class FNRDifference

Bases: _BaseBinaryFairness

The equality (or lack thereof) of the false negative rates across groups is an important fairness metric. In practice, this metric is implemented as a difference between the metric value for group 1 and group 2.

\[E[d(X)=0 \mid Y=1, g(X)] = E[d(X)=0, Y=1]\]

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).
membership_label (Union[str, float, int]) – Value indicating group membership. Default value is 1.

Return type:

False Negative Rate difference between groups.

class FORDifference

Bases: _BaseBinaryFairness

The equality (or lack thereof) of the false omission rates across groups is an important fairness metric. In practice, this metric is implemented as a difference between ratio of false negatives to negative examples in the data set for group 1 and group 2.

\[FOR = FN/N, conditioned on a protected attribute\]

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).
membership_label (Union[str, float, int]) – Value indicating group membership. Default value is 1.

Return type:

False Omission Rate difference between groups.

class GeneralizedEntropyIndex(positive_label_name: float = 1)

Bases: _BaseBinaryFairness

get_score(labels: List | ndarray | Series, predictions: List | ndarray | Series, alpha: float = 2) → float

Generalized entropy index is proposed as a unified individual and group fairness measure in [3]. With \(b_i = \hat{y}_i - y_i + 1\):

\[\begin{split}\mathcal{E}(\alpha) = \begin{cases} \frac{1}{n \alpha (\alpha-1)}\sum_{i=1}^n\left[\left(\frac{b_i}{\mu}\right)^\alpha - 1\right] & \alpha \ne 0, 1, \\ \frac{1}{n}\sum_{i=1}^n\frac{b_{i}}{\mu}\ln\frac{b_{i}}{\mu} & \alpha=1, \\ -\frac{1}{n}\sum_{i=1}^n\ln\frac{b_{i}}{\mu},& \alpha=0. \end{cases}\end{split}\]

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
alpha (float) – Parameter that regulates weight given to distances between values at different parts of the distribution. Default value is 2.

Returns:

General Entropy Index of the classifier.
References
———- –

[3]
T. Speicher, H. Heidari, N. Grgic-Hlaca, K. P. Gummadi, A. Singla, A. Weller, and M. B. Zafar, A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual and Group Unfairness via Inequality Indices, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018.

class PredictiveEquality

Bases: _BaseBinaryFairness

We define the predictive equality as the situation when accuracy of decisions is equal across race groups, as measured by false positive rate (FPR).

Drawing the analogy of gender classification where race is the protected attribute, across all race groups, the ratio of men incorrectly predicted to be a woman is the same.

More formally,

\[E[d(X)|Y=0, g(X)] = E[d(X), Y=0]\]

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).
membership_label (Union[str, float, int]) – Value indicating group membership. Default value is 1.

Return type:

Predictive Equality difference between groups.

StatisticalParity: alias of BinaryStatisticalParity

class TheilIndex(positive_label_name: float = 1)

Bases: _BaseBinaryFairness

get_score(labels: List | ndarray | Series, predictions: List | ndarray | Series) → float

The Theil index is the generalized entropy index with \(\alpha = 1\). See Generalized Entropy index.

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).

Return type:

Theil Index of the classifier.

Calculates and tabulates all of the fairness metric scores.

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).
membership_label (Union[str, float, int]) – Value indicating group membership. Default value is 1.

Return type:

Pandas data frame with all implemented binary fairness metrics.

Multiclass Fairness Metrics

class jurity.fairness.MultiClassFairnessMetrics

Bases: NamedTuple

Class containing a variety of fairness metrics for multi-class classification.

DisparateImpact: alias of MultiDisparateImpact

StatisticalParity: alias of MultiStatisticalParity

static get_all_scores(predictions: List | ndarray | Series, is_member: List | ndarray | Series, list_of_classes: List[str])

Calculates and tabulates all of the fairness metric scores.

Parameters:

predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).
list_of_classes (List[str]) – List with class labels.

Return type:

Pandas data frame with all implemented multi-class fairness metrics.

Binary Bias Mitigation

class jurity.mitigation.BinaryMitigation

Bases: NamedTuple

Class containing methods for bias mitigation in binary classification tasks.

class EqualizedOdds(seed=1)

Bases: _BaseMitigation

Idea: Imagine two groups have different ROC curves. Find the convex hull such that any FPR, TPR pair can be satisfied by either protected-group-conditional predictor. This might not be possible without randomization [4].

The output of this optimization is a tuple of four probabilities of flipping the likelihood of a positive prediction to achieve equal FPR & TPR across two groups. We can then apply these learned mixing rates on new unseen data to achieve fairer distributions of outcomes.

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
likelihoods (Union[List, np.ndarray, pd.Series]) – Scores between 0 and 1 from some black-box classifier.
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).

Return type:

None.

References

“Equality of Opportunity in Supervised Learning, Advances in Neural Information Processing Systems 29, 2016.

Apply fit and transform methods on the current dataset.

Parameters:

labels (Union[List, np.ndarray, pd.Series]) – Binary ground truth labels for the provided dataset (0/1).
predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
likelihoods (Union[List, np.ndarray, pd.Series]) – Scores between 0 and 1 from some black-box classifier.
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).

Returns:

fair_predictions (np.ndarray) – Fairer predictions with closely matching FPR & TPR across groups
fair_likelihoods (np.ndarray) – Fairer likelihoods with closely matching FPR & TPR across groups

Apply fairness probabilistic mixing rates to a new dataset.

The idea here is to probabilistically flip a subset of likelihoods and labels in each group based on learned mixing rates so that we achieve fairer distribution of outcomes.

There is a trade-off between fairness and accuracy of a classifier. In general, repairing fairness metrics results in lower accuracy, but the relationship is non-linear and data dependent.

Parameters:

predictions (Union[List, np.ndarray, pd.Series]) – Binary predictions from some black-box classifier (0/1).
likelihoods (Union[List, np.ndarray, pd.Series]) – Scores between 0 and 1 from some black-box classifier.
is_member (Union[List, np.ndarray, pd.Series]) – Binary membership labels (0/1).

Returns:

fair_predictions (np.ndarray) – Fairer predictions with closely matching FPR & TPR across groups
fair_likelihoods (np.ndarray) – Fairer likelihoods with closely matching FPR & TPR across groups

Binary Recommenders Metrics

class jurity.recommenders.BinaryRecoMetrics

Bases: NamedTuple

class AUC(click_column: str, k: int | None = None, user_id_column: str = 'user_id', item_id_column: str = 'item_id')

Bases: _BaseRecommenders

Area-Under-the-Curve

Calculates the AUC using a direct matching method. That is, AUC is calculated for instances where the actual item the user has seen matches one of the top-k recommendations.

get_score(actual_results: DataFrame, predicted_results: DataFrame, batch_accumulate: bool = False, return_extended_results: bool = False) → float | dict | Tuple[float, float] | Tuple[dict, dict]

Evaluates the current metric on the given data.

There are 4 scenarios controlled by the batch_accumulate and return_extended_results parameters:

Calculating the metric for the whole data:

This is the default method, which assumes you are operating on the full data and you want to get the metric by itself. Returns float.

print(auc.get_score(actual_responses, recommendations))
>>> 0.68

Calculating the extended results for the whole data:

This assumes you are operating on the full data and you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns dict.

print(auc.get_score(actual_responses, recommendations, return_extended_results=True))
>>> {'auc': 0.68, 'support': 122}

Calculating the metric across multiple batches.

This assumes that you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes that you want to get the metric by itself. Returns Tuple[float, float].

for actual_responses_batch, recommendations_batch in ..
    auc_batch, auc_acc = auc.get_score(actual_responses_batch, recommendations_batch, accumulate=True)
    print(f'AUC for this batch: {auc_batch} Overall AUC: {auc_acc}')
    >>> AUC for this batch: 0.65 Overall AUC: 0.68

Calculating the extended results across multiple matches:

This assumes you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns Tuple[dict, dict].

for actual_responses_batch, recommendations_batch in ..
    auc_batch, auc_acc = auc.get_score(actual_responses_batch, recommendations_batch, accumulate=True,
                                       return_extended_results=True)
    print(f'AUC for this batch: {auc_batch} Overall AUC: {auc_acc}')
    >>> AUC for this batch: {'auc': 0.65, 'support': 12} Overall AUC: {'auc': 0.68, 'support': 122}

Parameters:

actual_results (pd.DataFrame) – A pandas DataFrame for the ground truth user item interaction data, captured from historical logs. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
predicted_results (pd.DataFrame) – A pandas DataFrame for the recommended user item interaction data, captured from a recommendation algorithm. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
batch_accumulate (bool) – If specified, this parameter allows you to pass in minibatches of results and accumulate the metric correctly across the batches. This reduces the memory footprint and integrates easily with batched training. If specified, the get_score function will return a tuple of batch results and accumulated results.
return_extended_results (bool) – Whether the extended results such as the support should also be returned. If specified, the returned results will be of type dict. AUC currently returns auc and the support used to calculate AUC.

Returns:

metric – The averaged result(s). The return type is determined by the batch_accumulate and return_extended_results parameters. See the examples above.

Return type:

Union[float, dict, Tuple[float, float], Tuple[dict, dict]]

class CTR(click_column: str, k: int | None = None, user_id_column: str = 'user_id', item_id_column: str = 'item_id', value_column: str | None = None, estimation: str = 'matching', propensity_column: str | None = 'propensity')

Bases: _BaseRecommenders

Click-through rate

Three supported estimation methods:

1. Matching Calculates the CTR using a direct matching method. That is, CTR is only calculated for instances where the actual item the user has seen matches the recommendation.

2. Inverse Propensity Score (IPS) Calculates the IPS, an estimate of CTR with a weighted correction based on how likely an item was to be recommended by the historic policy if the user saw the item in the historic data.

\[IPS = \frac{1}{n} \sum r_a \times \frac{I(\hat{a} = a)}{p(a|x,h)}\]

In this equation:

\(n\) is the total size of the test data
\(r_a\) is the observed reward
\(\hat{a}\) is the recommended item
\(I(\hat{a} = a)\) is a boolean of whether the user-item pair has historic data
\(p(a|x,h)\) is the probability of the item being recommended for the test context given the historic data

3. Doubly Robust Estimation (DR) Calculates the DR, an estimate of CTR that combines the directly predicted values with a correction based on how likely an item was to be recommended by the historic policy if the user saw the item in the historic data.

\[DR = \frac{1}{n} \sigma (\hat{r_a} + \frac{(r_a -\hat{r_a}) I(\hat{a} = a)}{p(a|x,h)})\]

In this equation:

\(n\) is the total size of the test data
\(r_a\) is the observed reward
\(\hat{r_a}\) is the predicted reward
\(\hat{a}\) is the recommended item
\(I(\hat{a} = a)\) is a boolean of whether the user-item pair has historic data
\(p(a|x,h)\) is the probability of the item being recommended for the test context given the historic data

At a high level, doubly robust estimation combines a direct estimate with an IPS-like correction if historic data is available. If historic data is not available, the second term is 0 and only the predicted reward is used for the user-item pair.

IPS and DR implementations are based on: Dudík, Miroslav, John Langford, and Lihong Li. “Doubly robust policy evaluation and learning.” Proceedings of the 28th International Conference on International Conference on Machine Learning. 2011. Available as arXiv preprint arXiv:1103.4601

get_score(actual_results: DataFrame, predicted_results: DataFrame, batch_accumulate: bool = False, return_extended_results: bool = False) → float | dict | Tuple[float, float] | Tuple[dict, dict]

Evaluates the current metric on the given data.

There are 4 scenarios controlled by the batch_accumulate and return_extended_results parameters:

Calculating the metric for the whole data:

This is the default method, which assumes you are operating on the full data and you want to get the metric by itself. Returns float.

print(ctr.get_score(actual_responses_batch, recommendations_batch))
>>> 0.316

Calculating the extended results for the whole data:

This assumes you are operating on the full data and you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns dict.

print(ctr.get_score(actual_responses_batch, recommendations_batch, return_extended_results=True))
>>> {'ctr': 0.316, 'support': 122}

Calculating the metric across multiple batches.

This assumes that you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes that you want to get the metric by itself. Returns Tuple[float, float].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: 0.453 Overall CTR: 0.316

Calculating the extended results across multiple matches:

This assumes you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns Tuple[dict, dict].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True, return_extended_results=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: {'ctr': 0.453, 'support': 12} Overall CTR: {'ctr': 0.316, 'support': 122}

Parameters:

actual_results (pd.DataFrame) – A pandas DataFrame for the ground truth user item interaction data, captured from historical logs. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
predicted_results (pd.DataFrame) – A pandas DataFrame for the recommended user item interaction data, captured from a recommendation algorithm. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
batch_accumulate (bool) – If specified, this parameter allows you to pass in minibatches of results and accumulate the metric correctly across the batches. This reduces the memory footprint and integrates easily with batched training. If specified, the get_score function will return a tuple of batch results and accumulated results.
return_extended_results (bool) – Whether the extended results such as the support should also be returned. If specified, the returned results will be of type dict. CTR currently returns ctr and the support used to calculate CTR.

Returns:

metric – The averaged result(s). The return type is determined by the batch_accumulate and return_extended_results parameters. See the examples above.

Return type:

Union[float, dict, Tuple[float, float], Tuple[dict, dict]]

Ranking Recommenders Metrics

class jurity.recommenders.RankingRecoMetrics

Bases: NamedTuple

class MAP(click_column, k: int | None = None, user_id_column: str = 'user_id', item_id_column: str = 'item_id')

Bases: _BaseRecommenders

Mean Average Precision

\[MAP@k = \frac{1}{\left | A \right |} \sum_{i=1}^{\left | A \right |} \frac{1}{min(k,\left | A_i \right |))}\sum_{n=1}^k Precision_i(n) \times rel(P_{i,n})\]

Intuitively, MAP measures how precise the recommendations are while taking the ranking of the recommendations into account.

get_score(actual_results: DataFrame, predicted_results: DataFrame, batch_accumulate: bool = False, return_extended_results: bool = False) → float | dict | Tuple[float, float] | Tuple[dict, dict]

Evaluates the current metric on the given data.

There are 4 scenarios controlled by the batch_accumulate and return_extended_results parameters:

Calculating the metric for the whole data:

This is the default method, which assumes you are operating on the full data and you want to get the metric by itself. Returns float.

print(ctr.get_score(actual_responses_batch, recommendations_batch))
>>> 0.316

Calculating the extended results for the whole data:

This assumes you are operating on the full data and you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns dict.

print(ctr.get_score(actual_responses_batch, recommendations_batch, return_extended_results=True))
>>> {'ctr': 0.316, 'support': 122}

Calculating the metric across multiple batches.

This assumes that you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes that you want to get the metric by itself. Returns Tuple[float, float].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: 0.453 Overall CTR: 0.316

Calculating the extended results across multiple matches:

This assumes you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns Tuple[dict, dict].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True, return_extended_results=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: {'ctr': 0.453, 'support': 12} Overall CTR: {'ctr': 0.316, 'support': 122}

Parameters:

actual_results (pd.DataFrame) – A pandas DataFrame for the ground truth user item interaction data, captured from historical logs. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
predicted_results (pd.DataFrame) – A pandas DataFrame for the recommended user item interaction data, captured from a recommendation algorithm. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
batch_accumulate (bool) – If specified, this parameter allows you to pass in minibatches of results and accumulate the metric correctly across the batches. This reduces the memory footprint and integrates easily with batched training. If specified, the get_score function will return a tuple of batch results and accumulated results.
return_extended_results (bool) – Whether the extended results such as the support should also be returned. If specified, the returned results will be of type dict. MAP currently returns map and the support used to calculate MAP.

Returns:

metric – The averaged result(s). The return type is determined by the batch_accumulate and return_extended_results parameters. See the examples above.

Return type:

Union[float, dict, Tuple[float, float], Tuple[dict, dict]]

class NDCG(click_column, k: int | None = None, user_id_column: str = 'user_id', item_id_column: str = 'item_id')

Bases: _BaseRecommenders

Normalized Discounted Cumulative Gain

NDCG measures the ranking of the relevant items with a non-linear, discounted (log2) score per rank. NDCG is normalized such that the scores are between 0 and 1.

\[NDCG@k = \frac{1}{\left | A \right |} \sum_{i=1}^{\left | A \right |} \frac {\sum_{r=1}^{\left | P_i \right |} \frac{rel(P_{i,r})}{log_2(r+1)}}{\sum_{r=1}^{\left | A_i \right |} \frac{1}{log_2(r+1)}}\]

get_score(actual_results: DataFrame, predicted_results: DataFrame, batch_accumulate: bool = False, return_extended_results: bool = False) → float | dict | Tuple[float, float] | Tuple[dict, dict]

Evaluates the current metric on the given data.

There are 4 scenarios controlled by the batch_accumulate and return_extended_results parameters:

Calculating the metric for the whole data:

This is the default method, which assumes you are operating on the full data and you want to get the metric by itself. Returns float.

print(ctr.get_score(actual_responses_batch, recommendations_batch))
>>> 0.316

Calculating the extended results for the whole data:

This assumes you are operating on the full data and you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns dict.

print(ctr.get_score(actual_responses_batch, recommendations_batch, return_extended_results=True))
>>> {'ctr': 0.316, 'support': 122}

Calculating the metric across multiple batches.

This assumes that you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes that you want to get the metric by itself. Returns Tuple[float, float].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: 0.453 Overall CTR: 0.316

Calculating the extended results across multiple matches:

This assumes you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns Tuple[dict, dict].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True, return_extended_results=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: {'ctr': 0.453, 'support': 12} Overall CTR: {'ctr': 0.316, 'support': 122}

Parameters:

actual_results (pd.DataFrame) – A pandas DataFrame for the ground truth user item interaction data, captured from historical logs. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
predicted_results (pd.DataFrame) – A pandas DataFrame for the recommended user item interaction data, captured from a recommendation algorithm. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
batch_accumulate (bool) – If specified, this parameter allows you to pass in minibatches of results and accumulate the metric correctly across the batches. This reduces the memory footprint and integrates easily with batched training. If specified, the get_score function will return a tuple of batch results and accumulated results.
return_extended_results (bool) – Whether the extended results such as the support should also be returned. If specified, the returned results will be of type dict. NDCG currently returns ndcg and the support used to calculate NDCG.

Returns:

metric – The averaged result(s). The return type is determined by the batch_accumulate and return_extended_results parameters. See the examples above.

Return type:

Union[float, dict, Tuple[float, float], Tuple[dict, dict]]

class Precision(click_column, k: int | None = None, user_id_column: str = 'user_id', item_id_column: str = 'item_id')

Bases: _BaseRecommenders

Precision@k

Precision@k measures the precision of the recommendations when only k recommendations are made to the user. That is, it measures the ratio of recommendations among the top k items that are relevant.

\[Precision@k = \frac{1}{\left | A \cap P \right |}\sum_{i=1}^{\left | A \cap P \right |} \frac{\left | A_i \cap P_i[1:k] \right |}{\left | P_i[1:k] \right |}\]

get_score(actual_results: DataFrame, predicted_results: DataFrame, batch_accumulate: bool = False, return_extended_results: bool = False) → float | dict | Tuple[float, float] | Tuple[dict, dict]

Evaluates the current metric on the given data.

There are 4 scenarios controlled by the batch_accumulate and return_extended_results parameters:

Calculating the metric for the whole data:

This is the default method, which assumes you are operating on the full data and you want to get the metric by itself. Returns float.

print(ctr.get_score(actual_responses_batch, recommendations_batch))
>>> 0.316

Calculating the extended results for the whole data:

This assumes you are operating on the full data and you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns dict.

print(ctr.get_score(actual_responses_batch, recommendations_batch, return_extended_results=True))
>>> {'ctr': 0.316, 'support': 122}

Calculating the metric across multiple batches.

This assumes that you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes that you want to get the metric by itself. Returns Tuple[float, float].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: 0.453 Overall CTR: 0.316

Calculating the extended results across multiple matches:

This assumes you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns Tuple[dict, dict].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True, return_extended_results=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: {'ctr': 0.453, 'support': 12} Overall CTR: {'ctr': 0.316, 'support': 122}

Parameters:

actual_results (pd.DataFrame) – A pandas DataFrame for the ground truth user item interaction data, captured from historical logs. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
predicted_results (pd.DataFrame) – A pandas DataFrame for the recommended user item interaction data, captured from a recommendation algorithm. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
batch_accumulate (bool) – If specified, this parameter allows you to pass in minibatches of results and accumulate the metric correctly across the batches. This reduces the memory footprint and integrates easily with batched training. If specified, the get_score function will return a tuple of batch results and accumulated results.
return_extended_results (bool) – Whether the extended results such as the support should also be returned. If specified, the returned results will be of type dict. Precision currently returns precision and the support used to calculate Precision.

Returns:

metric – The averaged result(s). The return type is determined by the batch_accumulate and return_extended_results parameters. See the examples above.

Return type:

Union[float, dict, Tuple[float, float], Tuple[dict, dict]]

class Recall(click_column, k: int | None = None, user_id_column: str = 'user_id', item_id_column: str = 'item_id')

Bases: _BaseRecommenders

Recall@k

Recall@k measures the recall of the recommendations when only k recommendations are made to the user. That is, it measures the ratio of relevant items that were among the top k recommendations.

\[Recall@k = \frac{1}{\left | A \right |}\sum_{i=1}^{\left | A \right |} \frac{\left | A_i \cap P_i[1:k] \right |}{\left | A_i \right |}\]

get_score(actual_results: DataFrame, predicted_results: DataFrame, batch_accumulate: bool = False, return_extended_results: bool = False) → float | dict | Tuple[float, float] | Tuple[dict, dict]

Evaluates the current metric on the given data.

There are 4 scenarios controlled by the batch_accumulate and return_extended_results parameters:

Calculating the metric for the whole data:

This is the default method, which assumes you are operating on the full data and you want to get the metric by itself. Returns float.

print(ctr.get_score(actual_responses_batch, recommendations_batch))
>>> 0.316

Calculating the extended results for the whole data:

This assumes you are operating on the full data and you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns dict.

print(ctr.get_score(actual_responses_batch, recommendations_batch, return_extended_results=True))
>>> {'ctr': 0.316, 'support': 122}

Calculating the metric across multiple batches.

This assumes that you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes that you want to get the metric by itself. Returns Tuple[float, float].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: 0.453 Overall CTR: 0.316

Calculating the extended results across multiple matches:

This assumes you are operating on batched data, and will therefore call this method multiple times for each batch. It also assumes you want to get the auxiliary information such as the support in addition to the metric. The information returned depends on the metric. Returns Tuple[dict, dict].

for actual_responses_batch, recommendations_batch in ..
    ctr_batch, ctr_acc = ctr.get_score(actual_responses_batch, recommendations_batch, accumulate=True, return_extended_results=True)
    print(f'CTR for this batch: {ctr_batch} Overall CTR: {ctr_acc}')
    >>> CTR for this batch: {'ctr': 0.453, 'support': 12} Overall CTR: {'ctr': 0.316, 'support': 122}

Parameters:

actual_results (pd.DataFrame) – A pandas DataFrame for the ground truth user item interaction data, captured from historical logs. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
predicted_results (pd.DataFrame) – A pandas DataFrame for the recommended user item interaction data, captured from a recommendation algorithm. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
batch_accumulate (bool) – If specified, this parameter allows you to pass in minibatches of results and accumulate the metric correctly across the batches. This reduces the memory footprint and integrates easily with batched training. If specified, the get_score function will return a tuple of batch results and accumulated results.
return_extended_results (bool) – Whether the extended results such as the support should also be returned. If specified, the returned results will be of type dict. Recall currently returns recall and the support used to calculate Recall.

Returns:

metric – The averaged result(s). The return type is determined by the batch_accumulate and return_extended_results parameters. See the examples above.

Return type:

Union[float, dict, Tuple[float, float], Tuple[dict, dict]]

Diversity Recommenders Metrics

class jurity.recommenders.DiversityRecoMetrics

Bases: NamedTuple

class InterListDiversity(click_column, k: int | None = None, user_id_column: str = 'user_id', item_id_column: str = 'item_id', user_sample_size: int | float = 10000, seed: int = 1, metric: str | Callable = 'cosine', num_runs: int = 10, n_jobs: int = 1, working_memory: int | None = None)

Bases: object

Inter-List Diversity@k

Inter-List Diversity@k measures the inter-list diversity of the recommendations when only k recommendations are made to the user. It measures how user’s lists of recommendations are different from each other. This metric has a range in \([0, 1]\). The higher this metric is, the more diversified lists of items are recommended to different users. Let \(U\) denote the set of \(N\) unique users, \(u_i\), \(u_j \in U\) denote the i-th and j-th user in the user set, \(i, j \in \{1,2,\cdots,N\}\). \(R_{u_i}\) is the binary indicator vector representing provided recommendations for \(u_i\). \(I\) is the set of all unique user pairs, \(\forall~i<j, \{u_i, u_j\} \in I\).

\[Inter \mbox{-} list~diversity = \frac{\sum_{i,j, \{u_i, u_j\} \in I}(cosine\_distance(R_{u_i}, R_{u_j}))}{|I|}\]

By default, the reported metric is averaged over a number of num_runs (default=10) evaluations with each run using user_sample_size (default=10000) users, to ease the computing process and meanwhile get close approximation of this metric. When user_sample_size=None, all users will be used in evaluation.

get_score(actual_results: DataFrame, predicted_results: DataFrame, batch_accumulate: bool = False, return_extended_results: bool = False) → float | dict

Evaluates the current metric on the given data.

Parameters:

actual_results (Ignored.) – Ignored for calculating Inter-List Diversity while it is kept for making the API design consistent across different recommender metrics.
predicted_results (pd.DataFrame) – A pandas DataFrame for the recommended user item interaction data, captured from a recommendation algorithm. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
batch_accumulate (bool) – Should not be True for calculating Inter-List Diversity while it is kept for making the API design consistent across different recommender metrics.
return_extended_results (bool) – Whether the extended results such as the support should also be returned. If specified, the returned results will be of type dict. Inter-list diversity currently returns Inter-List Diversity and the support, which is the number of unique users to calculate it.

Returns:

metric – The averaged result(s). The return type is determined by return_extended_results parameters.

Return type:

Union[float, dict]

class IntraListDiversity(item_features: DataFrame, click_column, k: int | None = None, user_id_column: str = 'user_id', item_id_column: str = 'item_id', user_sample_size: int | float = 10000, seed: int = 1, metric: str | Callable = 'cosine', n_jobs: int = 1, num_runs: int = 10)

Bases: _BaseRecommenders

Intra-List Diversity@k

Intra-List Diversity@k measures the intra-list diversity of the recommendations when only k recommendations are made to the user. Given a list of items recommended to one user and the item features, the averaged pairwise cosine distances of items is calculated. Then the results from all users are averaged as the metric Intra-List Diversity@k. This metric has a range in \([0, 1]\). The higher this metric is, the more diversified items are recommended to each user. Let \(U\) denote the set of \(N\) unique users, \(u_i\) denote the i-th user in the user set, \(i \in \{1,2,\cdots,N\}\). \(v_p^{u_i}\), \(v_q^{u_i}\) are the item features of the p-th and q-th item in the list of items recommended to \(u_i\), \(p, q \in \{0,1,\cdots,k-1\}\). \(I^{u_i}\) is the set of all unique pairs of item indices for \(u_i\), \(\forall~p<q, \{p, q\} \in I^{u_i}\).

\[Intra\mbox{-} list~diversity = \frac{1}{N}\sum_{i=1}^N \frac{\sum_{p, q, \{p, q\} \in I^{u_i}}(cosine\_distance(v_p^{u_i}, v_q^{u_i}))}{|I^{u_i}|}\]

By default, the reported metric is averaged over a number of num_runs (default=10) evaluations with each run using user_sample_size (default=10000) users, to ease the computing process and meanwhile get close approximation of this metric. When user_sample_size=None, all users will be used in evaluation.

get_score(actual_results: DataFrame, predicted_results: DataFrame, batch_accumulate: bool = False, return_extended_results: bool = False) → float | dict

Evaluates the current metric on the given data.

Parameters:

predicted_results (pd.DataFrame) – A pandas DataFrame for the recommended user item interaction data, captured from a recommendation algorithm. The DataFrame should contain a minimum of two columns, including self._user_id_column, self._item_id_column, and anything else the metric may need. Each row contains the interaction of one user with one item, and the scores associated with this interaction. There can be multiple interactions per user, and there can be multiple users per DataFrame. However, the interactions for a specific user must be contained within a single DataFrame.
batch_accumulate (bool) – Should not be True for calculating Intra-List Diversity while it is kept for making the API design consistent across different recommender metrics.
return_extended_results (bool) – Whether the extended results such as the support should also be returned. If specified, the returned results will be of type dict. Intra-list diversity currently returns Intra-List Diversity and the support, which is the number of unique users to calculate it.

Returns:

metric – The averaged result(s). The return type is determined by return_extended_results parameters.

Return type:

Union[float, dict]