MABWiser Public API
base_mab
This module defines the abstract base class for contextual multi-armed bandit algorithms.
- class mabwiser.base_mab.BaseMAB(rng: _BaseRNG, arms: List[Arm], n_jobs: int, backend: str | None = None)
Bases:
object
Abstract base class for multi-armed bandits.
This module is not intended to be used directly, instead it declares the basic skeleton of multi-armed bandits together with a set of parameters that are common to every bandit algorithm.
It declares abstract methods that sub-classes can override to implement specific bandit policies using:
__init__
constructor to initialize the banditadd_arm
method to add a new armfit
method for trainingpartial_fit
method for _online learningpredict_expectations
method to retrieve the expectation of each armpredict
method for testing to retrieve the best arm based on the policyremove_arm
method for removing an armwarm_start
method for warm starting untrained (cold) arms
To ensure this is the case, alpha and l2_lambda are required to be greater than zero.
- rng
The random number generator.
- Type:
_BaseRNG
- arms
The list of all arms.
- Type:
List
- n_jobs
This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.
- Type:
int
- backend
Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and output. - “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky. - “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the
called function relies a lot on Python objects.
Default value is None. In this case the default backend selected by joblib will be used.
- Type:
str, optional
- arm_to_expectation
The dictionary of arms (keys) to their expected rewards (values).
- Type:
Dict[Arm, float]
- arm_to_status
The dictionary of arms (keys) to their status (values), where the status consists of -
is_trained
, which indicates whether an arm wasfit
orpartial_fit
; -is_warm
, which indicates whether an arm was warm started, and therefore has a trained model associated; - andwarm_started_by
, which indicates the arm that originally warm started this arm. Arms that were initially warm-started and then updated withpartial_fit
will retainis_warm
as True with the relevantwarm_started_by
arm for tracking purposes.- Type:
Dict[Arm, dict]
- add_arm(arm: Arm, binarizer: Callable | None = None) None
Introduces a new arm to the bandit.
Adds the new arm with zero expectations and calls the
_uptake_new_arm()
function of the sub-class.
- abstract fit(decisions: ndarray, rewards: ndarray, contexts: ndarray | None = None) None
Abstract method.
Fits the multi-armed bandit to the given decision and reward history and corresponding contexts if any.
- abstract partial_fit(decisions: ndarray, rewards: ndarray, contexts: ndarray | None = None) None
Abstract method.
Updates the multi-armed bandit with the given decision and reward history and corresponding contexts if any.
- abstract predict(contexts: ndarray | None = None) Arm | List[Arm]
Abstract method.
Returns the predicted arm.
- abstract predict_expectations(contexts: ndarray | None = None) Dict[Arm, int | float] | List[Dict[Arm, int | float]]
Abstract method.
Returns a dictionary from arms (keys) to their expected rewards (values).
- property trained_arms: List[Arm]
List of trained arms.
Arms for which at least one decision has been observed are deemed trained.
- abstract warm_start(arm_to_features: Dict[Arm, List[int | float]], distance_quantile: float) None
Abstract method.
Warm starts cold arms using similar warm arms based on distances between arm features. Only implemented for Learning Policies that make use of
_warm_start
method to copy arm information.
mab
This module defines the public interface of the MABWiser Library providing access to the following modules:
MAB
LearningPolicy
NeighborhoodPolicy
- class mabwiser.mab.LearningPolicy
Bases:
NamedTuple
- class EpsilonGreedy(epsilon: int | float = 0.1)
Bases:
NamedTuple
Epsilon Greedy Learning Policy.
This policy selects the arm with the highest expected reward with probability 1 - \(\epsilon\), and with probability \(\epsilon\) it selects an arm at random for exploration.
- epsilon
The probability of selecting a random arm for exploration. Integer or float. Must be between 0 and 1. Default value is 0.1.
- Type:
Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1'
- epsilon: int | float
Alias for field number 0
- class LinGreedy(epsilon: int | float = 0.1, l2_lambda: int | float = 1.0, scale: bool = False)
Bases:
NamedTuple
LinGreedy Learning Policy.
This policy trains a ridge regression for each arm. Then, given a given context, it predicts a regression value. This policy selects the arm with the highest regression value with probability 1 - \(\epsilon\), and with probability \(\epsilon\) it selects an arm at random for exploration.
- epsilon
The probability of selecting a random arm for exploration. Integer or float. Must be between 0 and 1. Default value is 0.1.
- Type:
Num
- l2_lambda
The regularization strength. Integer or float. Cannot be negative. Default value is 1.0.
- Type:
Num
- scale
Whether to scale features to have zero mean and unit variance. Uses StandardScaler in sklearn.preprocessing. Default value is False.
- Type:
bool
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.LinGreedy(epsilon=0.5)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[3, 2, 0, 1]]) 'Arm2'
- epsilon: int | float
Alias for field number 0
- l2_lambda: int | float
Alias for field number 1
- scale: bool
Alias for field number 2
- class LinTS(alpha: int | float = 1.0, l2_lambda: int | float = 1.0, scale: bool = False)
Bases:
NamedTuple
LinTS Learning Policy
For each arm LinTS trains a ridge regression and creates a multivariate normal distribution for the coefficients using the calculated coefficients as the mean and the covariance as:
\[\alpha^{2} (x_i^{T}x_i + \lambda * I_d)^{-1}\]The normal distribution is randomly sampled to obtain expected coefficients for the ridge regression for each prediction.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
The multivariate normal distribution uses Cholesky decomposition to guarantee deterministic behavior. This method requires that the covariance is a positive definite matrix. To ensure this is the case, alpha and l2_lambda are required to be greater than zero.
- alpha
The multiplier to determine the degree of exploration. Integer or float. Must be greater than zero. Default value is 1.0.
- Type:
Num
- l2_lambda
The regularization strength. Integer or float. Must be greater than zero. Default value is 1.0.
- Type:
Num
- scale
Whether to scale features to have zero mean and unit variance. Uses StandardScaler in sklearn.preprocessing. Default value is False.
- Type:
bool
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.LinTS(alpha=0.25)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[3, 2, 0, 1]]) 'Arm2'
- alpha: int | float
Alias for field number 0
- l2_lambda: int | float
Alias for field number 1
- scale: bool
Alias for field number 2
- class LinUCB(alpha: int | float = 1.0, l2_lambda: int | float = 1.0, scale: bool = False)
Bases:
NamedTuple
LinUCB Learning Policy.
This policy trains a ridge regression for each arm. Then, given a given context, it predicts a regression value and calculates the upper confidence bound of that prediction. The arm with the highest highest upper bound is selected.
The UCB for each arm is calculated as:
\[UCB = x_i \beta + \alpha \sqrt{(x_i^{T}x_i + \lambda * I_d)^{-1}x_i}\]Where \(\beta\) is the matrix of the ridge regression coefficients, \(\lambda\) is the regularization strength, and I_d is a dxd identity matrix where d is the number of features in the context data.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
- alpha
The parameter to control the exploration. Integer or float. Cannot be negative. Default value is 1.0.
- Type:
Num
- l2_lambda
The regularization strength. Integer or float. Cannot be negative. Default value is 1.0.
- Type:
Num
- scale
Whether to scale features to have zero mean and unit variance. Uses StandardScaler in sklearn.preprocessing. Default value is False.
- Type:
bool
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.LinUCB(alpha=1.25)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[3, 2, 0, 1]]) 'Arm2'
- alpha: int | float
Alias for field number 0
- l2_lambda: int | float
Alias for field number 1
- scale: bool
Alias for field number 2
- class Popularity
Bases:
NamedTuple
Randomized Popularity Learning Policy.
Returns a randomized popular arm for each prediction. The probability of selection for each arm is weighted by their mean reward. It assumes that the rewards are non-negative.
The probability of selection is calculated as:
\[P(arm) = \frac{ \mu_i } { \Sigma{ \mu } }\]where \(\mu_i\) is the mean reward for that arm.
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Popularity()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1'
- class Random
Bases:
NamedTuple
Random Learning Policy.
Returns a random arm for each prediction. The probability of selection for each arm is uniformly at random.
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Random()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
- class Softmax(tau: int | float = 1)
Bases:
NamedTuple
Softmax Learning Policy.
This policy selects each arm with a probability proportionate to its average reward. The average reward is calculated as a logistic function with each probability as:
\[P(arm) = \frac{ e ^ \frac{\mu_i - \max{\mu}}{ \tau } } { \Sigma{e ^ \frac{\mu - \max{\mu}}{ \tau }} }\]where \(\mu_i\) is the mean reward for that arm and \(\tau\) is the “temperature” to determine the degree of exploration.
- tau
The temperature to control the exploration. Integer or float. Must be greater than zero. Default value is 1.
- Type:
Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.Softmax(tau=1)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
- tau: int | float
Alias for field number 0
- class ThompsonSampling(binarizer: Callable | None = None)
Bases:
NamedTuple
Thompson Sampling Learning Policy.
This policy creates a beta distribution for each arm and then randomly samples from these distributions. The arm with the highest sample value is selected.
Notice that rewards must be binary to create beta distributions. If rewards are not binary, see the
binarizer
function.- binarizer
If rewards are not binary, a binarizer function is required. Given an arm decision and its corresponding reward, the binarizer function returns True/False or 0/1 to denote whether the decision counts as a success, i.e., True/1 based on the reward or False/0 otherwise.
The function signature of the binarizer is:
binarize(arm: Arm, reward: Num) -> True/False or 0/1
- Type:
Callable
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [1, 1, 1, 0] >>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling()) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> arm_to_threshold = {'Arm1':10, 'Arm2':10} >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [10, 20, 15, 7] >>> def binarize(arm, reward): return reward > arm_to_threshold[arm] >>> mab = MAB(list_of_arms, LearningPolicy.ThompsonSampling(binarizer=binarize)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
- binarizer: Callable
Alias for field number 0
- class UCB1(alpha: int | float = 1)
Bases:
NamedTuple
Upper Confidence Bound1 Learning Policy.
This policy calculates an upper confidence bound for the mean reward of each arm. It greedily selects the arm with the highest upper confidence bound.
The UCB for each arm is calculated as:
\[UCB = \mu_i + \alpha \times \sqrt[]{\frac{2 \times log(N)}{n_i}}\]Where \(\mu_i\) is the mean for that arm, \(N\) is the total number of trials, and \(n_i\) is the number of times the arm has been selected.
\(\alpha\) is a factor used to adjust how conservative the estimate is. Higher \(\alpha\) values promote more exploration.
- alpha
The parameter to control the exploration. Integer of float. Cannot be negative. Default value is 1.
- Type:
Num
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(list_of_arms, LearningPolicy.UCB1(alpha=1.25)) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm2'
- alpha: int | float
Alias for field number 0
- class mabwiser.mab.MAB(arms: List[Arm], learning_policy: LearningPolicyType, neighborhood_policy: NeighborhoodPolicyType | None = None, seed: int = 123456, n_jobs: int = 1, backend: str | None = None)
Bases:
object
MABWiser: Contextual Multi-Armed Bandit Library
MABWiser is a research library for fast prototyping of multi-armed bandit algorithms. It supports context-free, parametric and non-parametric contextual bandit models.
- arms
The list of all the arms available for decisions. Arms can be integers, strings, etc.
- Type:
list
- learning_policy
The learning policy.
- Type:
LearningPolicyType
- neighborhood_policy
The neighborhood policy.
- Type:
NeighborhoodPolicyType
- is_contextual
True if contextual policy is given, false otherwise. This is a read-only data field.
- Type:
bool
- seed
The random seed to initialize the internal random number generator. This is a read-only data field.
- Type:
numbers.Rational
- n_jobs
This is used to specify how many concurrent processes/threads should be used for parallelized routines. Default value is set to 1. If set to -1, all CPUs are used. If set to -2, all CPUs but one are used, and so on.
- Type:
int
- backend
Specify a parallelization backend implementation supported in the joblib library. Supported options are: - “loky” used by default, can induce some communication and memory overhead when exchanging input and
output data with the worker Python processes.
“multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
“threading” is a very low-overhead backend but, it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.
Default value is None. In this case the default backend selected by joblib will be used.
- Type:
str, optional
Examples
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab.fit(decisions, rewards) >>> mab.predict() 'Arm1' >>> mab.add_arm('Arm3') >>> mab.partial_fit(['Arm3'], [30]) >>> mab.predict() 'Arm3'
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1', 'Arm2'] >>> rewards = [20, 17, 25, 9, 11] >>> contexts = [[0, 0, 0], [1, 0, 1], [0, 1, 1], [0, 0, 0], [1, 1, 1]] >>> contextual_mab = MAB(arms, LearningPolicy.EpsilonGreedy(), NeighborhoodPolicy.KNearest(k=3)) >>> contextual_mab.fit(decisions, rewards, contexts) >>> contextual_mab.predict([[1, 1, 0], [1, 1, 1], [0, 1, 0]]) ['Arm2', 'Arm2', 'Arm2'] >>> contextual_mab.add_arm('Arm3') >>> contextual_mab.partial_fit(['Arm3'], [30], [[1, 1, 1]]) >>> contextual_mab.predict([[1, 1, 1]]) 'Arm3'
- add_arm(arm: Arm, binarizer: Callable | None = None) None
Adds an _arm_ to the list of arms.
Incorporates the arm into the learning and neighborhood policies with no training data.
- Parameters:
arm (Arm) – The new arm to be added.
binarizer (Callable) – The new binarizer function for Thompson Sampling.
- Return type:
No return.
- Raises:
TypeError – For ThompsonSampling, binarizer must be a callable function.:
ValueError – A binarizer function was provided but the learning policy is not Thompson Sampling.:
ValueError – The arm already exists.:
ValueError – The arm is
None
.:ValueError – The arm is
NaN
.:ValueError – The arm is
Infinity
.:
- fit(decisions: List[Arm] | ndarray | Series, rewards: List[int | float] | ndarray | Series, contexts: None | List[List[int | float]] | ndarray | Series | DataFrame = None) None
Fits the multi-armed bandit to the given decisions, their corresponding rewards and contexts, if any.
Validates arguments and raises exceptions in case there are violations.
- This function makes the following assumptions:
each decision corresponds to an arm of the bandit.
there are no
None
,Nan
, orInfinity
values in the contexts.
- Parameters:
decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.
rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.
contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context under which each decision is made. Default value is
None
, i.e., no contexts.
- Return type:
No return.
- Raises:
TypeError – Decisions and rewards are not given as list, numpy array or pandas series.:
TypeError – Contexts is not given as
None
, list, numpy array, pandas series or data frames.:ValueError – Length mismatch between decisions, rewards, and contexts.:
ValueError – Fitting contexts data when there is no contextual policy.:
ValueError – Contextual policy when fitting no contexts data.:
ValueError – Rewards contain
None
,Nan
, orInfinity
.:
- property learning_policy
Creates named tuple of the learning policy based on the implementor.
- Return type:
The learning policy.
- Raises:
NotImplementedError – MAB learning_policy property not implemented for this learning policy.:
- property neighborhood_policy
Creates named tuple of the neighborhood policy based on the implementor.
- Return type:
The neighborhood policy
- partial_fit(decisions: List[Arm] | ndarray | Series, rewards: List[int | float] | ndarray | Series, contexts: None | List[List[int | float]] | ndarray | Series | DataFrame = None) None
Updates the multi-armed bandit with the given decisions, their corresponding rewards and contexts, if any.
Validates arguments and raises exceptions in case there are violations.
- This function makes the following assumptions:
each decision corresponds to an arm of the bandit.
there are no
None
,Nan
, orInfinity
values in the contexts.
- Parameters:
decisions (Union[List[Arm], np.ndarray, pd.Series]) – The decisions that are made.
rewards (Union[List[Num], np.ndarray, pd.Series]) – The rewards that are received corresponding to the decisions.
contexts (Union[None, List[List[Num]], np.ndarray, pd.Series, pd.DataFrame] =) – The context under which each decision is made. Default value is
None
, i.e., no contexts.
- Return type:
No return.
- Raises:
TypeError – Decisions, rewards are not given as list, numpy array or pandas series.:
TypeError – Contexts is not given as
None
, list, numpy array, pandas series or data frames.:ValueError – Length mismatch between decisions, rewards, and contexts.:
ValueError – Fitting contexts data when there is no contextual policy.:
ValueError – Contextual policy when fitting no contexts data.:
ValueError – Rewards contain
None
,Nan
, orInfinity
:
- predict(contexts: None | List[int | float] | List[List[int | float]] | ndarray | Series | DataFrame = None) Arm | List[Arm]
Returns the “best” arm (or arms list if multiple contexts are given) based on the expected reward.
The definition of the best depends on the specified learning policy. Contextual learning policies and neighborhood policies require contexts data in training. In testing, they return the best arm given new context(s).
- Parameters:
contexts (Union[None, List[Num], List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context for the expected rewards. Default value is None. If contexts is not
None
for context-free bandits, the predictions returned will be a list of the same length as contexts.- Return type:
The recommended arm or recommended arms list.
- Raises:
TypeError – Contexts is not given as
None
, list, numpy array, pandas series or data frames.:ValueError – Prediction with context policy requires context data.:
- predict_expectations(contexts: None | List[int | float] | List[List[int | float]] | ndarray | Series | DataFrame = None) Dict[Arm, int | float] | List[Dict[Arm, int | float]]
Returns a dictionary of arms (key) to their expected rewards (value).
Contextual learning policies and neighborhood policies require contexts data for expected rewards.
- Parameters:
contexts (Union[None, List[Num], List[List[Num]], np.ndarray, pd.Series, pd.DataFrame]) – The context for the expected rewards. Default value is None. If contexts is not
None
for context-free bandits, the predicted expectations returned will be a list of the same length as contexts.- Return type:
The dictionary of arms (key) to their expected rewards (value), or a list of such dictionaries.
- Raises:
TypeError – Contexts is not given as
None
, list, numpy array or pandas data frames.:ValueError – Prediction with context policy requires context data.:
- remove_arm(arm: Arm) None
Removes an _arm_ from the list of arms.
- Parameters:
arm (Arm) – The existing arm to be removed.
- Return type:
No return.
- Raises:
ValueError – The arm does not exist.:
ValueError – The arm is
None
.:ValueError – The arm is
NaN
.:ValueError – The arm is
Infinity
.:
- warm_start(arm_to_features: Dict[Arm, List[int | float]], distance_quantile: float) None
Warm-start untrained (cold) arms of the multi-armed bandit.
Validates arguments and raises exceptions in case there are violations.
The warm-start procedure depends on the learning and neighborhood policy. Note that for certain neighborhood policies (e.g., LSHNearest, KNearest, Radius) warm start can only be performed after the nearest neighbors have been determined in the “predict” step. Accordingly, warm start has to be executed for each context being predicted which is computationally expensive.
- Parameters:
arm_to_features (Dict[Arm, List[Num]]) – Numeric representation for each arm.
distance_quantile (float) – Value between 0 and 1 used to determine if an item can be warm started or not using closest item. All cold items will be warm started if 1 and none will be warm started if 0.
- Return type:
No return.
- Raises:
TypeError – Arm features are not given as a dictionary.:
TypeError – Distance quantile is not given as a float.:
ValueError – Distance quantile is not between 0 and 1.:
ValueError – The arms in arm_to_features do not match arms.:
- class mabwiser.mab.NeighborhoodPolicy
Bases:
NamedTuple
- class Clusters(n_clusters: int | float = 2, is_minibatch: bool = False)
Bases:
NamedTuple
Clusters Neighborhood Policy.
Clusters is a k-means clustering approach that uses the observations from the closest cluster with a learning policy. Supports
KMeans
andMiniBatchKMeans
.- n_clusters
The number of clusters. Integer. Must be at least 2. Default value is 2.
- Type:
Num
- is_minibatch
Boolean flag to use
MiniBatchKMeans
or not. Default value is False.- Type:
bool
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.Clusters(3)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [3, 1]
- is_minibatch: bool
Alias for field number 1
- n_clusters: int | float
Alias for field number 0
- class KNearest(k: int = 1, metric: str = 'euclidean')
Bases:
NamedTuple
KNearest Neighborhood Policy.
KNearest is a nearest neighbors approach that selects the k-nearest observations to be used with a learning policy.
- k
The number of neighbors to select. Integer value. Must be greater than zero. Default value is 1.
- Type:
int
- metric
The metric used to calculate distance. Accepts any of the metrics supported by
scipy.spatial.distance.cdist
. Default value is Euclidean distance.- Type:
str
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.KNearest(2, "euclidean")) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [1, 1]
- k: int
Alias for field number 0
- metric: str
Alias for field number 1
- class LSHNearest(n_dimensions: int = 5, n_tables: int = 3, no_nhood_prob_of_arm: List | None = None)
Bases:
NamedTuple
Locality-Sensitive Hashing Approximate Nearest Neighbors Policy.
LSHNearest is a nearest neighbors approach that uses locality sensitive hashing with a simhash to select observations to be used with a learning policy.
For the simhash, contexts are projected onto a hyperplane of n_context_cols x n_dimensions and each column of the hyperplane is evaluated for its sign, giving an ordered array of binary values. This is converted to a base 10 integer used as the hash code to assign the context to a hash table. This process is repeated for a specified number of hash tables, where each has a unique, randomly-generated hyperplane. To select the neighbors for a context, the hash code is calculated for each hash table and any contexts with the same hashes are selected as the neighbors.
As with the radius or k value for other nearest neighbors algorithms, selecting the best number of dimensions and tables requires tuning. For the dimensions, a good starting point is to use the log of the square root of the number of rows in the training data. This will give you sqrt(n_rows) number of hashes.
The number of dimensions and number of tables have inverse effects from each other on the number of empty neighborhoods and average neighborhood size. Increasing the dimensionality decreases the number of collisions, which increases the precision of the approximate neighborhood but also potentially increases the number of empty neighborhoods. Increasing the number of hash tables increases the likelihood of capturing neighbors the other random hyperplanes miss and increases the average neighborhood size. It should be noted that the fit operation is O(2**n_dimensions).
- n_dimensions
The number of dimensions to use for the hyperplane. Integer value. Must be greater than zero. Default value is 5.
- Type:
int
- n_tables
The number of hash tables. Integer value. Must be greater than zero. Default value is 3.
- Type:
int
- no_nhood_prob_of_arm
The probabilities associated with each arm. Used to select random arm if context has no neighbors. If not given, a uniform random distribution over all arms is assumed. The probabilities should sum up to 1.
- Type:
None or List
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.LSHNearest(5, 3)) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [3, 1]
- n_dimensions: int
Alias for field number 0
- n_tables: int
Alias for field number 1
- no_nhood_prob_of_arm: List | None
Alias for field number 2
- class Radius(radius: int | float = 0.05, metric: str = 'euclidean', no_nhood_prob_of_arm: List | None = None)
Bases:
NamedTuple
Radius Neighborhood Policy.
Radius is a nearest neighborhood approach that selects the observations within a given radius to be used with a learning policy.
- radius
The maximum distance within which to select observations. Integer or Float. Must be greater than zero. Default value is 1.
- Type:
Num
- metric
The metric used to calculate distance. Accepts any of the metrics supported by scipy.spatial.distance.cdist. Default value is Euclidean distance.
- Type:
str
- no_nhood_prob_of_arm
The probabilities associated with each arm. Used to select random arm if context has no neighbors. If not given, a uniform random distribution over all arms is assumed. The probabilities should sum up to 1.
- Type:
None or List
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = [1, 2, 3, 4] >>> decisions = [1, 1, 1, 2, 2, 3, 3, 3, 3, 3] >>> rewards = [0, 1, 1, 0, 0, 0, 0, 1, 1, 1] >>> contexts = [[0, 1, 2, 3, 5], [1, 1, 1, 1, 1], [0, 0, 1, 0, 0],[0, 2, 2, 3, 5], [1, 3, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 4, 3, 5], [0, 1, 2, 4, 5], [1, 2, 1, 1, 3], [0, 2, 1, 0, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.Radius(2, "euclidean")) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[0, 1, 2, 3, 5], [1, 1, 1, 1, 1]]) [3, 1]
- metric: str
Alias for field number 1
- no_nhood_prob_of_arm: List | None
Alias for field number 2
- radius: int | float
Alias for field number 0
- class TreeBandit(tree_parameters: Dict = {})
Bases:
NamedTuple
TreeBandit Neighborhood Policy.
This policy fits a decision tree for each arm using context history. It uses the leaves of these trees to partition the context space into regions and keeps a list of rewards for each leaf. To predict, it receives a context vector and goes to the corresponding leaf at each arm’s tree and applies the given context-free MAB learning policy to predict expectations and choose an arm.
The TreeBandit neighborhood policy is compatible with the following context-free learning policies only: EpsilonGreedy, ThompsonSampling and UCB1.
The TreeBandit neighborhood policy is a modified version of the TreeHeuristic algorithm presented in: Adam N. Elmachtoub, Ryan McNellis, Sechan Oh, Marek Petrik A Practical Method for Solving Contextual Bandit Problems Using Decision Trees, UAI 2017
- tree_parameters
Parameters of the decision tree. The keys must match the parameters of sklearn.tree.DecisionTreeRegressor. When a parameter is not given, the default parameters from sklearn.tree.DecisionTreeRegressor will be chosen. Default value is an empty dictionary.
- Type:
Dict, **kwarg
Example
>>> from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy >>> list_of_arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> contexts = [[0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 1, 0], [3, 2, 1, 0]] >>> mab = MAB(list_of_arms, LearningPolicy.EpsilonGreedy(epsilon=0), NeighborhoodPolicy.TreeBandit()) >>> mab.fit(decisions, rewards, contexts) >>> mab.predict([[3, 2, 0, 1]]) 'Arm2'
- tree_parameters: Dict
Alias for field number 0
simulator
This module provides a simulation utility for comparing algorithms and hyper-parameter tuning.
- class mabwiser.simulator.Simulator(bandits: ~typing.List[tuple], decisions: ~typing.List[Arm] | ~numpy.ndarray | ~pandas.core.series.Series, rewards: ~typing.List[int | float] | ~numpy.ndarray | ~pandas.core.series.Series, contexts: None | ~typing.List[~typing.List[int | float]] | ~numpy.ndarray | ~pandas.core.series.Series | ~pandas.core.frame.DataFrame = None, scaler: callable | None = None, test_size: float = 0.3, is_ordered: bool = False, batch_size: int = 0, evaluator: callable = <function default_evaluator>, seed: int = 123456, is_quick: bool = False, log_file: str | None = None, log_format: str = '%(asctime)s %(levelname)s %(message)s')
Bases:
object
Multi-Armed Bandit Simulator.
This utility runs a simulation using historic data and a collection of multi-armed bandits from the MABWiser library or that extends the BaseMAB class in MABWiser.
It can be used to run a simple simulation with a single bandit or to compare multiple bandits for policy selection, hyper-parameter tuning, etc.
Nearest Neighbor bandits that use the default Radius and KNearest implementations from MABWiser are converted to custom versions that share distance calculations to speed up the simulation. These custom versions also track statistics about the neighborhoods that can be used in evaluation.
The results can be accessed as the arms_to_stats, model_to_predictions, model_to_confusion_matrices, and models_to_evaluations properties.
When using partial fitting, an additional confusion matrix is calculated for all predictions after all of the batches are processed.
A log of the simulation tracks the experiment progress.
- bandits
A list of tuples of the name of each bandit and the bandit object.
- Type:
list[(str, bandit)]
- decisions
The complete decision history to be used in train and test.
- Type:
array
- rewards
The complete array history to be used in train and test.
- Type:
array
- contexts
The complete context history to be used in train and test.
- Type:
array
- scaler
A scaler object from sklearn.preprocessing.
- Type:
scaler
- test_size
The size of the test set
- Type:
float
- is_ordered
Whether to use a chronological division for the train-test split. If false, uses sklearn’s train_test_split.
- Type:
bool
- batch_size
The size of each batch for online learning.
- Type:
int
- evaluator
The function for evaluating the bandits. Values are stored in bandit_to_arm_to_stats_avg. Must have the function signature function(arms_to_stats_train: dictionary, predictions: list, decisions: np.ndarray, rewards: np.ndarray, metric: str).
- Type:
callable
- is_quick
Flag to skip neighborhood statistics.
- Type:
bool
- logger
The logger object.
- Type:
Logger
- arms
The list of arms used by the bandits.
- Type:
list
- arm_to_stats_total
Descriptive statistics for the complete data set.
- Type:
dict
- arm_to_stats_train
Descriptive statistics for the training data.
- Type:
dict
- arm_to_stats_test
Descriptive statistics for the test data.
- Type:
dict
- bandit_to_arm_to_stats_avg
Descriptive statistics for the predictions made by each bandit based on means from training data.
- Type:
dict
- bandit_to_arm_to_stats_min
Descriptive statistics for the predictions made by each bandit based on minimums from training data.
- Type:
dict
- bandit_to_arm_to_stats_max
Descriptive statistics for the predictions made by each bandit based on maximums from training data.
- Type:
dict
- bandit_to_confusion_matrices
The confusion matrices for each bandit.
- Type:
dict
- bandit_to_predictions
The prediction for each item in the test set for each bandit.
- Type:
dict
- bandit_to_expectations
The arm_to_expectations for each item in the test set for each bandit. For context-free bandits, there is a single dictionary for each batch.
- Type:
dict
- bandit_to_neighborhood_size
The number of neighbors in each neighborhood for each row in the test set. Calculated when using a Radius neighborhood policy, or a custom class that inherits from it. Not calculated when is_quick is True.
- Type:
dict
- bandit_to_arm_to_stats_neighborhoods
The arm_to_stats for each neighborhood for each row in the test set. Calculated when using Radius or KNearest, or a custom class that inherits from one of them. Not calculated when is_quick is True.
- Type:
dict
- test_indices
The indices of the rows in the test set. If input was not zero-indexed, these will reflect their position in the input rather than actual index.
- Type:
list
Example
>>> from mabwiser.mab import MAB, LearningPolicy >>> arms = ['Arm1', 'Arm2'] >>> decisions = ['Arm1', 'Arm1', 'Arm2', 'Arm1'] >>> rewards = [20, 17, 25, 9] >>> mab1 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=123456) >>> mab2 = MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.30), seed=123456) >>> bandits = [('EG 25%', mab1), ('EG 30%', mab2)] >>> offline_sim = Simulator(bandits, decisions, rewards, test_size=0.5, batch_size=0) >>> offline_sim.run() >>> offline_sim.bandit_to_arm_to_stats_avg['EG 30%']['Arm1'] {'count': 1, 'sum': 9, 'min': 9, 'max': 9, 'mean': 9.0, 'std': 0.0}
- get_arm_stats(decisions: ndarray, rewards: ndarray) dict
Calculates descriptive statistics for each arm in the provided data set.
- Parameters:
decisions (np.ndarray) – The decisions to filter the rewards.
rewards (np.ndarray) – The rewards to get statistics about.
- Returns:
Arm_to_stats dictionary.
Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}
- static get_stats(rewards: ndarray) dict
Calculates descriptive statistics for the given array of rewards.
- Parameters:
rewards (nd.nparray) – Array of rewards for a single arm.
- Returns:
A dictionary of descriptive statistics.
Dictionary has the format {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}
- plot(metric: str = 'avg', is_per_arm: bool = False) None
Generates a plot of the cumulative sum of the rewards for each bandit. Simulation must be run before calling this method.
- Parameters:
metric (str) – The bandit_to_arm_to_stats to use to generate the plot. Must be ‘avg’, ‘min’, or ‘max
is_per_arm (bool) – Whether to plot each arm separately or use an aggregate statistic.
- Raises:
AssertionError Descriptive statics for predictions are missing. –
TypeError Metric must be a string. –
TypeError The per_arm flag must be a boolean. –
ValueError The metric must be one of avg, min or max. –
- Return type:
None
- run() None
Run simulator
Runs a simulation concurrently for all bandits in the bandits list.
- Return type:
None
- mabwiser.simulator.default_evaluator(arms: List[Arm], decisions: ndarray, rewards: ndarray, predictions: List[Arm], arm_to_stats: dict, stat: str, start_index: int, nn: bool = False) dict
Default evaluation function.
Calculates predicted rewards for the test batch based on predicted arms. When the predicted arm is the same as the historic decision, the historic reward is used. When the predicted arm is different, the mean, min or max reward from the training data is used. If using Radius or KNearest neighborhood policy, the statistics from the neighborhood are used instead of the entire training set.
The simulator supports custom evaluation functions, but they must have this signature to work with the simulation pipeline.
- Parameters:
arms (list) – The list of arms.
decisions (np.ndarray) – The historic decisions for the batch being evaluated.
rewards (np.ndarray) – The historic rewards for the batch being evaluated.
predictions (list) – The predictions for the batch being evaluated.
arm_to_stats (dict) – The dictionary of descriptive statistics for each arm to use in evaluation.
stat (str) – Which metric from arm_to_stats to use. Takes the values ‘min’, ‘max’, ‘mean’.
start_index (int) – The index of the first row in the batch. For offline simulations it is 0. For _online simulations it is batch size * batch number. Used to select the correct index from arm_to_stats if there are separate entries for each row in the test set.
nn (bool) – Whether the results are from one of the simulator custom nearest neighbors implementations.
- Returns:
An arm_to_stats dictionary for the predictions in the batch.
Dictionary has the format {arm {‘count’, ‘sum’, ‘min’, ‘max’, ‘mean’, ‘std’}}
utils
This module provides a number of constants and helper functions.
- class mabwiser.utils.Arm
Arm type is defined as integer, float, or string.
alias of
Union
[int
,float
,str
]
- class mabwiser.utils.Constants
Bases:
NamedTuple
Constant values used by the modules.
- default_seed = 123456
The default random seed.
- distance_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean']
The distance metrics supported by neighborhood policies.
- mabwiser.utils.Num
Num type is defined as integer or float.
alias of
Union
[int
,float
]
- mabwiser.utils.argmax(dictionary: Dict[Arm, int | float]) Arm
Returns the first key with the maximum value.
- mabwiser.utils.check_false(expression: bool, exception: Exception) None
Checks that given expression is false, otherwise raises the given exception.
- mabwiser.utils.check_true(expression: bool, exception: Exception) None
Checks that given expression is true, otherwise raises the given exception.
- mabwiser.utils.create_rng(seed: int) _BaseRNG
Returns an rng object
- Parameters:
seed (int) – the seed of the rng
- Returns:
out – An rng object that implements the base rng class
- Return type:
_BaseRNG
- mabwiser.utils.reset(dictionary: Dict, value) None
Maps every key to the given value.