MABWiser Public API

base_mab

This module defines the abstract base class for contextual multi-armed bandit algorithms.

class mabwiser.base_mab.BaseMAB(rng: _BaseRNG, arms: List[Arm], n_jobs: int, backend: str | None = None)

Bases: object

Abstract base class for multi-armed bandits.

This module is not intended to be used directly, instead it declares the basic skeleton of multi-armed bandits together with a set of parameters that are common to every bandit algorithm.

It declares abstract methods that sub-classes can override to implement specific bandit policies using:

__init__ constructor to initialize the bandit

add_arm method to add a new arm

fit method for training

partial_fit method for _online learning

predict_expectations method to retrieve the expectation of each arm

predict method for testing to retrieve the best arm based on the policy

remove_arm method for removing an arm

warm_start method for warm starting untrained (cold) arms

To ensure this is the case, alpha and l2_lambda are required to be greater than zero.

rng

The random number generator.

Type:: _BaseRNG

arms

The list of all arms.

Type:: List

n_jobs

This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.

Type:: int

backend

Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and output. - “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky. - “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the

called function relies a lot on Python objects.

Default value is None. In this case the default backend selected by joblib will be used.

Type:: str, optional

arm_to_expectation

The dictionary of arms (keys) to their expected rewards (values).

Type:: Dict[Arm, float]

arm_to_status

The dictionary of arms (keys) to their status (values), where the status consists of - is_trained, which indicates whether an arm was fit or partial_fit; - is_warm, which indicates whether an arm was warm started, and therefore has a trained model associated; - and warm_started_by, which indicates the arm that originally warm started this arm. Arms that were initially warm-started and then updated with partial_fit will retain is_warm as True with the relevant warm_started_by arm for tracking purposes.

Type:: Dict[Arm, dict]

add_arm(arm: Arm, binarizer: Callable | None = None) → None

Introduces a new arm to the bandit.

Adds the new arm with zero expectations and calls the _uptake_new_arm() function of the sub-class.

property cold_arms: List[Arm]: List of cold arms

abstract fit(decisions: ndarray, rewards: ndarray, contexts: ndarray | None = None) → None

Abstract method.

Fits the multi-armed bandit to the given decision and reward history and corresponding contexts if any.

abstract partial_fit(decisions: ndarray, rewards: ndarray, contexts: ndarray | None = None) → None

Abstract method.

Updates the multi-armed bandit with the given decision and reward history and corresponding contexts if any.

abstract predict(contexts: ndarray | None = None) → Arm | List[Arm]

Abstract method.

Returns the predicted arm.

abstract predict_expectations(contexts: ndarray | None = None) → Dict[Arm, int | float] | List[Dict[Arm, int | float]]

Abstract method.

Returns a dictionary from arms (keys) to their expected rewards (values).

remove_arm(arm: Arm) → None: Removes arm from the bandit.

property trained_arms: List[Arm]

List of trained arms.

Arms for which at least one decision has been observed are deemed trained.

abstract warm_start(arm_to_features: Dict[Arm, List[int | float]], distance_quantile: float) → None

Abstract method.

Warm starts cold arms using similar warm arms based on distances between arm features. Only implemented for Learning Policies that make use of _warm_start method to copy arm information.

mab

This module defines the public interface of the MABWiser Library providing access to the following modules:

MAB

LearningPolicy

NeighborhoodPolicy

class mabwiser.mab.LearningPolicy

Bases: NamedTuple

class EpsilonGreedy(epsilon: int | float = 0.1)

Bases: NamedTuple

Epsilon Greedy Learning Policy.

This policy selects the arm with the highest expected reward with probability 1 - \(\epsilon\), and with probability \(\epsilon\) it selects an arm at random for exploration.

epsilon

The probability of selecting a random arm for exploration. Integer or float. Must be between 0 and 1. Default value is 0.1.

Type:: Num

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456)
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm1'

epsilon: int | float: Alias for field number 0

class LinGreedy(epsilon: int | float = 0.1, l2_lambda: int | float = 1.0, scale: bool = False)

Bases: NamedTuple

LinGreedy Learning Policy.

This policy trains a ridge regression for each arm. Then, given a given context, it predicts a regression value. This policy selects the arm with the highest regression value with probability 1 - \(\epsilon\), and with probability \(\epsilon\) it selects an arm at random for exploration.

epsilon

The probability of selecting a random arm for exploration. Integer or float. Must be between 0 and 1. Default value is 0.1.

Type:: Num

l2_lambda

The regularization strength. Integer or float. Cannot be negative. Default value is 1.0.

Type:: Num

scale

Whether to scale features to have zero mean and unit variance. Uses StandardScaler in sklearn.preprocessing. Default value is False.

Type:: bool

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.LinGreedy(epsilon=0.5))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[3, 2, 0, 1]])
'Arm2'

epsilon: int | float: Alias for field number 0

l2_lambda: int | float: Alias for field number 1

scale: bool: Alias for field number 2

class LinTS(alpha: int | float = 1.0, l2_lambda: int | float = 1.0, scale: bool = False)

Bases: NamedTuple

LinTS Learning Policy

For each arm LinTS trains a ridge regression and creates a multivariate normal distribution for the coefficients using the calculated coefficients as the mean and the covariance as:

\[\alpha^{2} (x_i^{T}x_i + \lambda * I_d)^{-1}\]

The normal distribution is randomly sampled to obtain expected coefficients for the ridge regression for each prediction.

\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.

The multivariate normal distribution uses Cholesky decomposition to guarantee deterministic behavior. This method requires that the covariance is a positive definite matrix. To ensure this is the case, alpha and l2_lambda are required to be greater than zero.

alpha

The multiplier to determine the degree of exploration. Integer or float. Must be greater than zero. Default value is 1.0.

Type:: Num

l2_lambda

The regularization strength. Integer or float. Must be greater than zero. Default value is 1.0.

Type:: Num

scale

Whether to scale features to have zero mean and unit variance. Uses StandardScaler in sklearn.preprocessing. Default value is False.

Type:: bool

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.LinTS(alpha=0.25))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[3, 2, 0, 1]])
'Arm2'

alpha: int | float: Alias for field number 0

l2_lambda: int | float: Alias for field number 1

scale: bool: Alias for field number 2

class LinUCB(alpha: int | float = 1.0, l2_lambda: int | float = 1.0, scale: bool = False)

Bases: NamedTuple

LinUCB Learning Policy.

This policy trains a ridge regression for each arm. Then, given a given context, it predicts a regression value and calculates the upper confidence bound of that prediction. The arm with the highest highest upper bound is selected.

The UCB for each arm is calculated as:

\[UCB = x_i \beta + \alpha \sqrt{(x_i^{T}x_i + \lambda * I_d)^{-1}x_i}\]

Where \(\beta\) is the matrix of the ridge regression coefficients, \(\lambda\) is the regularization strength, and I_d is a dxd identity matrix where d is the number of features in the context data.

\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.

alpha

The parameter to control the exploration. Integer or float. Cannot be negative. Default value is 1.0.

Type:: Num

l2_lambda

The regularization strength. Integer or float. Cannot be negative. Default value is 1.0.

Type:: Num

scale

Whether to scale features to have zero mean and unit variance. Uses StandardScaler in sklearn.preprocessing. Default value is False.

Type:: bool

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.LinUCB(alpha=1.25))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[3, 2, 0, 1]])
'Arm2'

alpha: int | float: Alias for field number 0

l2_lambda: int | float: Alias for field number 1

scale: bool: Alias for field number 2

class Popularity

Bases: NamedTuple

Randomized Popularity Learning Policy.

Returns a randomized popular arm for each prediction. The probability of selection for each arm is weighted by their mean reward. It assumes that the rewards are non-negative.

The probability of selection is calculated as:

\[P(arm) = \frac{ \mu_i } { \Sigma{ \mu } }\]

where \(\mu_i\) is the mean reward for that arm.

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(list_of_arms, LearningPolicy.Popularity())
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm1'

class Random

Bases: NamedTuple

Random Learning Policy.

Returns a random arm for each prediction. The probability of selection for each arm is uniformly at random.

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(list_of_arms, LearningPolicy.Random())
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'

class Softmax(tau: int | float = 1)

Bases: NamedTuple

Softmax Learning Policy.

This policy selects each arm with a probability proportionate to its average reward. The average reward is calculated as a logistic function with each probability as:

\[P(arm) = \frac{ e ^ \frac{\mu_i - \max{\mu}}{ \tau } } { \Sigma{e ^ \frac{\mu - \max{\mu}}{ \tau }} }\]

where \(\mu_i\) is the mean reward for that arm and \(\tau\) is the “temperature” to determine the degree of exploration.

tau

The temperature to control the exploration. Integer or float. Must be greater than zero. Default value is 1.

Type:: Num

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(list_of_arms, LearningPolicy.Softmax(tau=1))
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'

tau: int | float: Alias for field number 0

class ThompsonSampling(binarizer: Callable | None = None)

Bases: NamedTuple

Thompson Sampling Learning Policy.

This policy creates a beta distribution for each arm and then randomly samples from these distributions. The arm with the highest sample value is selected.

Notice that rewards must be binary to create beta distributions. If rewards are not binary, see the binarizer function.

binarizer

If rewards are not binary, a binarizer function is required. Given an arm decision and its corresponding reward, the binarizer function returns True/False or 0/1 to denote whether the decision counts as a success, i.e., True/1 based on the reward or False/0 otherwise.

The function signature of the binarizer is:

binarize(arm: Arm, reward: Num) -> True/False or 0/1

Type:: Callable

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [1, 1, 1, 0]
>>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling())
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> arm_to_threshold = {'Arm1':10, 'Arm2':10}
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [10, 20, 15, 7]
>>> def binarize(arm, reward): return reward > arm_to_threshold[arm]
>>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling(binarizer=binarize))
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'

binarizer: Callable: Alias for field number 0

class UCB1(alpha: int | float = 1)

Bases: NamedTuple

Upper Confidence Bound1 Learning Policy.

This policy calculates an upper confidence bound for the mean reward of each arm. It greedily selects the arm with the highest upper confidence bound.

The UCB for each arm is calculated as:

\[UCB = \mu_i + \alpha \times \sqrt[]{\frac{2 \times log(N)}{n_i}}\]

Where \(\mu_i\) is the mean for that arm, \(N\) is the total number of trials, and \(n_i\) is the number of times the arm has been selected.

\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.

alpha

The parameter to control the exploration. Integer of float. Cannot be negative. Default value is 1.

Type:: Num

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(list_of_arms, LearningPolicy.UCB1(alpha=1.25))
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm2'

alpha: int | float: Alias for field number 0

class mabwiser.mab.MAB(arms: List[Arm], learning_policy: LearningPolicyType, neighborhood_policy: NeighborhoodPolicyType | None = None, seed: int = 123456, n_jobs: int = 1, backend: str | None = None)

Bases: object

MABWiser: Contextual Multi-Armed Bandit Library

MABWiser is a research library for fast prototyping of multi-armed bandit algorithms. It supports context-free, parametric and non-parametric contextual bandit models.

arms

The list of all the arms available for decisions. Arms can be integers, strings, etc.

Type:: list

learning_policy

The learning policy.

Type:: LearningPolicyType

neighborhood_policy

The neighborhood policy.

Type:: NeighborhoodPolicyType

is_contextual

True if contextual policy is given, false otherwise. This is a read-only data field.

Type:: bool

seed

The random seed to initialize the internal random number generator. This is a read-only data field.

Type:: numbers.Rational

n_jobs

This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.

Type:: int

backend

Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and

output data with the worker Python processes.

“multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
“threading” is a very low-overhead backend but, it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.

Default value is None. In this case the default backend selected by joblib will be used.

Type:: str, optional

Examples

>>> from mabwiser.mab import MAB, LearningPolicy
>>> arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456)
>>> mab.fit(decisions, rewards)
>>> mab.predict()
'Arm1'
>>> mab.add_arm('Arm3')
>>> mab.partial_fit(['Arm3'], [30])
>>> mab.predict()
'Arm3'

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1', 'Arm2']
>>> rewards = [20, 17, 25, 9, 11]
>>> contexts = [[0, 0, 0], [1, 0, 1], [0, 1, 1], [0, 0, 0], [1, 1, 1]]
>>> contextual_mab = MAB(arms, LearningPolicy.EpsilonGreedy(), NeighborhoodPolicy.KNearest(k=3))
>>> contextual_mab.fit(decisions, rewards, contexts)
>>> contextual_mab.predict([[1, 1, 0], [1, 1, 1], [0, 1, 0]])
['Arm2', 'Arm2', 'Arm2']
>>> contextual_mab.add_arm('Arm3')
>>> contextual_mab.partial_fit(['Arm3'], [30], [[1, 1, 1]])
>>> contextual_mab.predict([[1, 1, 1]])
'Arm3'

add_arm(arm: Arm, binarizer: Callable | None = None) → None

Adds an _arm_ to the list of arms.

Incorporates the arm into the learning and neighborhood policies with no training data.

Parameters:

arm (Arm) – The new arm to be added.
binarizer (Callable) – The new binarizer function for Thompson Sampling.

Return type:

No return.

Raises:

TypeError – For ThompsonSampling, binarizer must be a callable function.:
ValueError – A binarizer function was provided but the learning policy is not Thompson Sampling.:
ValueError – The arm already exists.:
ValueError – The arm is None.:
ValueError – The arm is NaN.:
ValueError – The arm is Infinity.:

property cold_arms: List[Arm]

Fits the multi-armed bandit to the given decisions, their corresponding rewards and contexts, if any.

Validates arguments and raises exceptions in case there are violations.

This function makes the following assumptions:

each decision corresponds to an arm of the bandit.
there are no None, Nan, or Infinity values in the contexts.

Parameters:

decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.
rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.
contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context under which each decision is made. Default value is None, i.e., no contexts.

Return type:

No return.

Raises:

TypeError – Decisions and rewards are not given as list, numpy array or pandas series.:
TypeError – Contexts is not given as None, list, numpy array, pandas series or data frames.:
ValueError – Length mismatch between decisions, rewards, and contexts.:
ValueError – Fitting contexts data when there is no contextual policy.:
ValueError – Contextual policy when fitting no contexts data.:
ValueError – Rewards contain None, Nan, or Infinity.:

property learning_policy

Creates named tuple of the learning policy based on the implementor.

Return type:: The learning policy.
Raises:: NotImplementedError – MAB learning_policy property not implemented for this learning policy.:

property neighborhood_policy

Creates named tuple of the neighborhood policy based on the implementor.

Return type:: The neighborhood policy

Updates the multi-armed bandit with the given decisions, their corresponding rewards and contexts, if any.

Validates arguments and raises exceptions in case there are violations.

This function makes the following assumptions:

each decision corresponds to an arm of the bandit.
there are no None, Nan, or Infinity values in the contexts.

Parameters:

decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.
rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.
contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame] =) – The context under which each decision is made. Default value is None, i.e., no contexts.

Return type:

No return.

Raises:

TypeError – Decisions, rewards are not given as list, numpy array or pandas series.:
TypeError – Contexts is not given as None, list, numpy array, pandas series or data frames.:
ValueError – Length mismatch between decisions, rewards, and contexts.:
ValueError – Fitting contexts data when there is no contextual policy.:
ValueError – Contextual policy when fitting no contexts data.:
ValueError – Rewards contain None, Nan, or Infinity:

Returns the “best” arm (or arms list if multiple contexts are given) based on the expected reward.

The definition of the best depends on the specified learning policy. Contextual learning policies and neighborhood policies require contexts data in training. In testing, they return the best arm given new context(s).

Parameters:

contexts (Union[None, List[Num], List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context for the expected rewards. Default value is None. If contexts is not None for context-free bandits, the predictions returned will be a list of the same length as contexts.

Return type:

The recommended arm or recommended arms list.

Raises:

TypeError – Contexts is not given as None, list, numpy array, pandas series or data frames.:
ValueError – Prediction with context policy requires context data.:

Returns a dictionary of arms (key) to their expected rewards (value).

Contextual learning policies and neighborhood policies require contexts data for expected rewards.

Parameters:

contexts (Union[None, List[Num], List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context for the expected rewards. Default value is None. If contexts is not None for context-free bandits, the predicted expectations returned will be a list of the same length as contexts.

Return type:

The dictionary of arms (key) to their expected rewards (value), or a list of such dictionaries.

Raises:

TypeError – Contexts is not given as None, list, numpy array or pandas data frames.:
ValueError – Prediction with context policy requires context data.:

remove_arm(arm: Arm) → None

Removes an _arm_ from the list of arms.

Parameters:

arm (Arm) – The existing arm to be removed.

Return type:

No return.

Raises:

ValueError – The arm does not exist.:
ValueError – The arm is None.:
ValueError – The arm is NaN.:
ValueError – The arm is Infinity.:

warm_start(arm_to_features: Dict[Arm, List[int | float]], distance_quantile: float) → None

Warm-start untrained (cold) arms of the multi-armed bandit.

Validates arguments and raises exceptions in case there are violations.

The warm-start procedure depends on the learning and neighborhood policy. Note that for certain neighborhood policies (e.g., LSHNearest, KNearest, Radius) warm start can only be performed after the nearest neighbors have been determined in the “predict” step. Accordingly, warm start has to be executed for each context being predicted which is computationally expensive.

Parameters:

arm_to_features (Dict[Arm, List[Num]]) – Numeric representation for each arm.
distance_quantile (float) – Value between 0 and 1 used to determine if an item can be warm started or not using closest item. All cold items will be warm started if 1 and none will be warm started if 0.

Return type:

No return.

Raises:

TypeError – Arm features are not given as a dictionary.:
TypeError – Distance quantile is not given as a float.:
ValueError – Distance quantile is not between 0 and 1.:
ValueError – The arms in arm_to_features do not match arms.:

class mabwiser.mab.NeighborhoodPolicy

Bases: NamedTuple

class Clusters(n_clusters: int | float = 2, is_minibatch: bool = False)

Bases: NamedTuple

Clusters Neighborhood Policy.

Clusters is a k-means clustering approach that uses the observations from the closest cluster with a learning policy. Supports KMeans and MiniBatchKMeans.

n_clusters

The number of clusters. Integer. Must be at least 2. Default value is 2.

Type:: Num

is_minibatch

Boolean flag to use MiniBatchKMeans or not. Default value is False.

Type:: bool

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = [1, 2, 3, 4]
>>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
>>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
>>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1],                             [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.Clusters(3))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]])
[3, 1]

is_minibatch: bool: Alias for field number 1

n_clusters: int | float: Alias for field number 0

class KNearest(k: int = 1, metric: str = 'euclidean')

Bases: NamedTuple

KNearest Neighborhood Policy.

KNearest is a nearest neighbors approach that selects the k-nearest observations to be used with a learning policy.

k

The number of neighbors to select. Integer value. Must be greater than zero. Default value is 1.

Type:: int

metric

The metric used to calculate distance. Accepts any of the metrics supported by scipy.spatial.distance.cdist. Default value is Euclidean distance.

Type:: str

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = [1, 2, 3, 4]
>>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
>>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
>>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1],                             [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0),                           NeighborhoodPolicy.KNearest(2, "euclidean"))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]])
[1, 1]

k: int: Alias for field number 0

metric: str: Alias for field number 1

class LSHNearest(n_dimensions: int = 5, n_tables: int = 3, no_nhood_prob_of_arm: List | None = None)

Bases: NamedTuple

Locality-Sensitive Hashing Approximate Nearest Neighbors Policy.

LSHNearest is a nearest neighbors approach that uses locality sensitive hashing with a simhash to select observations to be used with a learning policy.

For the simhash, contexts are projected onto a hyperplane of n_context_cols x n_dimensions and each column of the hyperplane is evaluated for its sign, giving an ordered array of binary values. This is converted to a base 10 integer used as the hash code to assign the context to a hash table. This process is repeated for a specified number of hash tables, where each has a unique, randomly-generated hyperplane. To select the neighbors for a context, the hash code is calculated for each hash table and any contexts with the same hashes are selected as the neighbors.

As with the radius or k value for other nearest neighbors algorithms, selecting the best number of dimensions and tables requires tuning. For the dimensions, a good starting point is to use the log of the square root of the number of rows in the training data. This will give you sqrt(n_rows) number of hashes.

The number of dimensions and number of tables have inverse effects from each other on the number of empty neighborhoods and average neighborhood size. Increasing the dimensionality decreases the number of collisions, which increases the precision of the approximate neighborhood but also potentially increases the number of empty neighborhoods. Increasing the number of hash tables increases the likelihood of capturing neighbors the other random hyperplanes miss and increases the average neighborhood size. It should be noted that the fit operation is O(2**n_dimensions).

n_dimensions

The number of dimensions to use for the hyperplane. Integer value. Must be greater than zero. Default value is 5.

Type:: int

n_tables

The number of hash tables. Integer value. Must be greater than zero. Default value is 3.

Type:: int

no_nhood_prob_of_arm

The probabilities associated with each arm. Used to select random arm if context has no neighbors. If not given, a uniform random distribution over all arms is assumed. The probabilities should sum up to 1.

Type:: None or List

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = [1, 2, 3, 4]
>>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
>>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
>>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1],                             [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0),                           NeighborhoodPolicy.LSHNearest(5, 3))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]])
[3, 1]

n_dimensions: int: Alias for field number 0

n_tables: int: Alias for field number 1

no_nhood_prob_of_arm: List | None: Alias for field number 2

class Radius(radius: int | float = 0.05, metric: str = 'euclidean', no_nhood_prob_of_arm: List | None = None)

Bases: NamedTuple

Radius Neighborhood Policy.

Radius is a nearest neighborhood approach that selects the observations within a given radius to be used with a learning policy.

radius

The maximum distance within which to select observations. Integer or Float. Must be greater than zero. Default value is 1.

Type:: Num

metric

The metric used to calculate distance. Accepts any of the metrics supported by scipy.spatial.distance.cdist. Default value is Euclidean distance.

Type:: str

no_nhood_prob_of_arm

The probabilities associated with each arm. Used to select random arm if context has no neighbors. If not given, a uniform random distribution over all arms is assumed. The probabilities should sum up to 1.

Type:: None or List

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = [1, 2, 3, 4]
>>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
>>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
>>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1],                             [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0),                           NeighborhoodPolicy.Radius(2, "euclidean"))
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]])
[3, 1]

metric: str: Alias for field number 1

no_nhood_prob_of_arm: List | None: Alias for field number 2

radius: int | float: Alias for field number 0

class TreeBandit(tree_parameters: Dict = {})

Bases: NamedTuple

TreeBandit Neighborhood Policy.

This policy fits a decision tree for each arm using context history. It uses the leaves of these trees to partition the context space into regions and keeps a list of rewards for each leaf. To predict, it receives a context vector and goes to the corresponding leaf at each arm’s tree and applies the given context-free MAB learning policy to predict expectations and choose an arm.

The TreeBandit neighborhood policy is compatible with the following context-free learning policies only: EpsilonGreedy, ThompsonSampling and UCB1.

The TreeBandit neighborhood policy is a modified version of the TreeHeuristic algorithm presented in: Adam N. Elmachtoub, Ryan McNellis, Sechan Oh, Marek Petrik A Practical Method for Solving Contextual Bandit Problems Using Decision Trees, UAI 2017

tree_parameters

Parameters of the decision tree. The keys must match the parameters of sklearn.tree.DecisionTreeRegressor. When a parameter is not given, the default parameters from sklearn.tree.DecisionTreeRegressor will be chosen. Default value is an empty dictionary.

Type:: Dict, **kwarg

Example

>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
>>> list_of_arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]]
>>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.TreeBandit())
>>> mab.fit(decisions, rewards, contexts)
>>> mab.predict([[3, 2, 0, 1]])
'Arm2'

tree_parameters: Dict: Alias for field number 0

simulator

This module provides a simulation utility for comparing algorithms and hyper-parameter tuning.

class mabwiser.simulator.Simulator(bandits: ~typing.List[tuple], decisions: ~typing.List[Arm] | ~numpy.ndarray | ~pandas.core.series.Series, rewards: ~typing.List[int | float] | ~numpy.ndarray | ~pandas.core.series.Series, contexts: None | ~typing.List[~typing.List[int | float]] | ~numpy.ndarray | ~pandas.core.series.Series | ~pandas.core.frame.DataFrame = None, scaler: callable | None = None, test_size: float = 0.3, is_ordered: bool = False, batch_size: int = 0, evaluator: callable = <function default_evaluator>, seed: int = 123456, is_quick: bool = False, log_file: str | None = None, log_format: str = '%(asctime)s %(levelname)s %(message)s')

Bases: object

Multi-Armed Bandit Simulator.

This utility runs a simulation using historic data and a collection of multi-armed bandits from the MABWiser library or that extends the BaseMAB class in MABWiser.

It can be used to run a simple simulation with a single bandit or to compare multiple bandits for policy selection, hyper-parameter tuning, etc.

Nearest Neighbor bandits that use the default Radius and KNearest implementations from MABWiser are converted to custom versions that share distance calculations to speed up the simulation. These custom versions also track statistics about the neighborhoods that can be used in evaluation.

The results can be accessed as the arms_to_stats, model_to_predictions, model_to_confusion_matrices, and models_to_evaluations properties.

When using partial fitting, an additional confusion matrix is calculated for all predictions after all of the batches are processed.

A log of the simulation tracks the experiment progress.

bandits

A list of tuples of the name of each bandit and the bandit object.

Type:: list[(str, bandit)]

decisions

The complete decision history to be used in train and test.

Type:: array

rewards

The complete array history to be used in train and test.

Type:: array

contexts

The complete context history to be used in train and test.

Type:: array

scaler

A scaler object from sklearn.preprocessing.

Type:: scaler

test_size

The size of the test set

Type:: float

is_ordered

Whether to use a chronological division for the train-test split. If false, uses sklearn’s train_test_split.

Type:: bool

batch_size

The size of each batch for online learning.

Type:: int

evaluator

The function for evaluating the bandits. Values are stored in bandit_to_arm_to_stats_avg. Must have the function signature function(arms_to_stats_train: dictionary, predictions: list, decisions: np.ndarray, rewards: np.ndarray, metric: str).

Type:: callable

is_quick

Flag to skip neighborhood statistics.

Type:: bool

logger

The logger object.

Type:: Logger

arms

The list of arms used by the bandits.

Type:: list

arm_to_stats_total

Descriptive statistics for the complete data set.

Type:: dict

arm_to_stats_train

Descriptive statistics for the training data.

Type:: dict

arm_to_stats_test

Descriptive statistics for the test data.

Type:: dict

bandit_to_arm_to_stats_avg

Descriptive statistics for the predictions made by each bandit based on means from training data.

Type:: dict

bandit_to_arm_to_stats_min

Descriptive statistics for the predictions made by each bandit based on minimums from training data.

Type:: dict

bandit_to_arm_to_stats_max

Descriptive statistics for the predictions made by each bandit based on maximums from training data.

Type:: dict

bandit_to_confusion_matrices

The confusion matrices for each bandit.

Type:: dict

bandit_to_predictions

The prediction for each item in the test set for each bandit.

Type:: dict

bandit_to_expectations

The arm_to_expectations for each item in the test set for each bandit. For context-free bandits, there is a single dictionary for each batch.

Type:: dict

bandit_to_neighborhood_size

The number of neighbors in each neighborhood for each row in the test set. Calculated when using a Radius neighborhood policy, or a custom class that inherits from it. Not calculated when is_quick is True.

Type:: dict

bandit_to_arm_to_stats_neighborhoods

The arm_to_stats for each neighborhood for each row in the test set. Calculated when using Radius or KNearest, or a custom class that inherits from one of them. Not calculated when is_quick is True.

Type:: dict

test_indices

The indices of the rows in the test set. If input was not zero-indexed, these will reflect their position in the input rather than actual index.

Type:: list

Example

>>> from mabwiser.mab import MAB, LearningPolicy
>>> arms = ['Arm1', 'Arm2']
>>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1']
>>> rewards = [20, 17, 25, 9]
>>> mab1 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456)
>>> mab2 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.30), seed=123456)
>>> bandits = [('EG 25%', mab1), ('EG 30%', mab2)]
>>> offline_sim = Simulator(bandits, decisions, rewards, test_size=0.5, batch_size=0)
>>> offline_sim.run()
>>> offline_sim.bandit_to_arm_to_stats_avg['EG 30%']['Arm1']
{'count': 1, 'sum': 9, 'min': 9, 'max': 9, 'mean': 9.0, 'std': 0.0}

get_arm_stats(decisions: ndarray, rewards: ndarray) → dict

Calculates descriptive statistics for each arm in the provided data set.

Parameters:

decisions (np.ndarray) – The decisions to filter the rewards.
rewards (np.ndarray) – The rewards to get statistics about.

Returns:

Arm_to_stats dictionary.
Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}

static get_stats(rewards: ndarray) → dict

Calculates descriptive statistics for the given array of rewards.

Parameters:

rewards (nd.nparray) – Array of rewards for a single arm.

Returns:

A dictionary of descriptive statistics.
Dictionary has the format {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}

plot(metric: str = 'avg', is_per_arm: bool = False) → None

Generates a plot of the cumulative sum of the rewards for each bandit. Simulation must be run before calling this method.

Parameters:

metric (str) – The bandit_to_arm_to_stats to use to generate the plot. Must be ‘avg’, ‘min’, or ‘max
is_per_arm (bool) – Whether to plot each arm separately or use an aggregate statistic.

Raises:

AssertionError Descriptive statics for predictions are missing. –
TypeError Metric must be a string. –
TypeError The per_arm flag must be a boolean. –
ValueError The metric must be one of avg, min or max. –

Return type:

None

run() → None

Run simulator

Runs a simulation concurrently for all bandits in the bandits list.

Return type:: None

mabwiser.simulator.default_evaluator(arms: List[Arm], decisions: ndarray, rewards: ndarray, predictions: List[Arm], arm_to_stats: dict, stat: str, start_index: int, nn: bool = False) → dict

Default evaluation function.

Calculates predicted rewards for the test batch based on predicted arms. When the predicted arm is the same as the historic decision, the historic reward is used. When the predicted arm is different, the mean, min or max reward from the training data is used. If using Radius or KNearest neighborhood policy, the statistics from the neighborhood are used instead of the entire training set.

The simulator supports custom evaluation functions, but they must have this signature to work with the simulation pipeline.

Parameters:

arms (list) – The list of arms.
decisions (np.ndarray) – The historic decisions for the batch being evaluated.
rewards (np.ndarray) – The historic rewards for the batch being evaluated.
predictions (list) – The predictions for the batch being evaluated.
arm_to_stats (dict) – The dictionary of descriptive statistics for each arm to use in evaluation.
stat (str) – Which metric from arm_to_stats to use. Takes the values ‘min’, ‘max’, ‘mean’.
start_index (int) – The index of the first row in the batch. For offline simulations it is 0. For _online simulations it is batch size * batch number. Used to select the correct index from arm_to_stats if there are separate entries for each row in the test set.
nn (bool) – Whether the results are from one of the simulator custom nearest neighbors implementations.

Returns:

An arm_to_stats dictionary for the predictions in the batch.
Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}

utils

This module provides a number of constants and helper functions.

class mabwiser.utils.Arm

Arm type is defined as integer, float, or string.

alias of Union[int, float, str]

class mabwiser.utils.Constants

Bases: NamedTuple

Constant values used by the modules.

default_seed = 123456: The default random seed.

distance_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean']: The distance metrics supported by neighborhood policies.

mabwiser.utils.Num

Num type is defined as integer or float.

alias of Union[int, float]

mabwiser.utils.argmax(dictionary: Dict[Arm, int | float]) → Arm: Returns the first key with the maximum value.

mabwiser.utils.argmin(dictionary: Dict) → Arm: Returns the first key that has the minimum value.

mabwiser.utils.check_false(expression: bool, exception: Exception) → None: Checks that given expression is false, otherwise raises the given exception.

mabwiser.utils.check_true(expression: bool, exception: Exception) → None: Checks that given expression is true, otherwise raises the given exception.

mabwiser.utils.create_rng(seed: int) → _BaseRNG

Returns an rng object

Parameters:: seed (int) – the seed of the rng
Returns:: out – An rng object that implements the base rng class
Return type:: _BaseRNG

mabwiser.utils.reset(dictionary: Dict, value) → None: Maps every key to the given value.