Usage Examples

The examples below show how to use different bandit policies to make decisions among multiple arms based on their expected rewards.

Conceptually, given a set of historical decisions and their corresponding rewards, the high-level idea behind MABWiser is to train a model using the fit() method to make predictions about next best decisions using the predict() method.

It is possible to retrieve the expected reward of each arm using the predict_expectations() method and online training is available using the partial_fit() method. New arms can be added to the bandits using the add_arm() method. Decisions and rewards data support lists, 1D numpy arrays, and pandas series. Contexts data supports 2D lists, 2D numpy arrays, pandas series and data frames.

See also

Additional examples are available in the examples folder in the repo.

Context-Free MAB

    from mabwiser.mab import MAB, LearningPolicy

######################################################################################
#
# MABWiser
# Scenario: A/B Testing for Website Layout Design
#
# An e-commerce website experiments with 2 different layouts options for their homepage
# Each layouts decision leads to generating different revenues
#
# What should the choice of layouts be based on historical data?
#
######################################################################################

# Arms
options = [1, 2]

# Historical data of layouts decisions and corresponding rewards
layouts = [1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 1]
revenues = [10, 17, 22, 9, 4, 0, 7, 8, 20, 9, 50, 5, 7, 12, 10]

# Arm to features
arm_to_features = {1: [0, 0, 1], 2: [1, 1, 0], 3: [1, 1, 0]}

###################################
# Epsilon Greedy Learning Policy
###################################

# Epsilon Greedy learning policy with random exploration set to 15%
greedy = MAB(arms=options,
             learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.15),
             seed=123456)

# Learn from previous layouts decisions and revenues generated
greedy.fit(decisions=layouts, rewards=revenues)

# Predict the next best layouts decision
prediction = greedy.predict()

# Expected revenues of each layouts learnt from historical data based on epsilon greedy policy
expectations = greedy.predict_expectations()

# Results
print("Epsilon Greedy: ", prediction, " ", expectations)
assert(prediction == 2)

# Additional historical data becomes available which allows online learning
additional_layouts = [1, 2, 1, 2]
additional_revenues = [0, 12, 7, 19]

# Online updating of the model
greedy.partial_fit(additional_layouts, additional_revenues)

# Adding a new layout option
greedy.add_arm(3)

# Warm start new arm
greedy.warm_start(arm_to_features, distance_quantile=0.5)

Parametric Contextual MAB

import pandas as pd
from sklearn.preprocessing import StandardScaler

from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy

######################################################################################
#
# MABWiser
# Scenario: Advertisement Optimization
#
# An e-commerce website needs to solve the problem of which ad to display to online users
# Each advertisement decision leads to generating different revenues
#
# What should the choice of advertisement be given the context of an online user
# based on customer data such as age, click rate, subscriber?
#
######################################################################################

# Arms
ads = [1, 2, 3, 4, 5]

# Historical data of ad decisions with corresponding revenues and context information
train_df = pd.DataFrame({'ad': [1, 1, 1, 2, 4, 5, 3, 3, 2, 1, 4, 5, 3, 2, 5],
                                                     'revenues': [10, 17, 22, 9, 4, 20, 7, 8, 20, 9, 50, 5, 7, 12, 10],
                                                     'age': [22, 27, 39, 48, 21, 20, 19, 37, 52, 26, 18, 42, 55, 57, 38],
                                                     'click_rate': [0.2, 0.6, 0.99, 0.68, 0.15, 0.23, 0.75, 0.17,
                                                                                    0.33, 0.65, 0.56, 0.22, 0.19, 0.11, 0.83],
                                                     'subscriber': [1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0]}
                                                     )

# Arm features for warm start
arm_to_features = {1: [0, 1, 1], 2: [0, 0.5, 0.5], 3: [1, 1, 0.5], 4: [0.2, 1, 0], 5: [0, 1, 0.1], 6: [0, 0.5, 0.5]}

# Test data to for new prediction
test_df = pd.DataFrame({'age': [37, 52], 'click_rate': [0.5, 0.6], 'subscriber': [0, 1]})
test_df_revenue = pd.Series([7, 13])

# Scale the training and test data
scaler = StandardScaler()
train = scaler.fit_transform(train_df[['age', 'click_rate', 'subscriber']])
test = scaler.transform(test_df)

##################################################
# Linear Upper Confidence Bound Learning Policy
##################################################

# LinUCB learning policy with alpha 1.25 and l2_lambda 1
linucb = MAB(arms=ads,
                             learning_policy=LearningPolicy.LinUCB(alpha=1.25, l2_lambda=1))

# Learn from previous ads shown and revenues generated
linucb.fit(decisions=train_df['ad'], rewards=train_df['revenues'], contexts=train)

# Predict the next best ad to show
prediction = linucb.predict(test)

# Expectation of each ad based on learning from past ad revenues
expectations = linucb.predict_expectations(test)

# Results
print("LinUCB: ", prediction, " ", expectations)
assert(prediction == [5, 2])

# Online update of model
linucb.partial_fit(decisions=prediction, rewards=test_df_revenue, contexts=test)

# Update the model with new arm
linucb.add_arm(6)

# Warm start new arm
linucb.warm_start(arm_to_features, distance_quantile=0.75)

Non-Parametric Contextual MAB

import pandas as pd
from sklearn.preprocessing import StandardScaler

from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy

######################################################################################
#
# MABWiser
# Scenario: Advertisement Optimization
#
# An e-commerce website needs to solve the problem of which ad to display to online users
# Each advertisement decision leads to generating different revenues
#
# What should the choice of advertisement be given the context of an online user
# based on customer data such as age, click rate, subscriber?
#
######################################################################################

# Arms
ads = [1, 2, 3, 4, 5]

# Historical data of ad decisions with corresponding revenues and context information
train_df = pd.DataFrame({'ad': [1, 1, 1, 2, 4, 5, 3, 3, 2, 1, 4, 5, 3, 2, 5],
                                                     'revenues': [10, 17, 22, 9, 4, 20, 7, 8, 20, 9, 50, 5, 7, 12, 10],
                                                     'age': [22, 27, 39, 48, 21, 20, 19, 37, 52, 26, 18, 42, 55, 57, 38],
                                                     'click_rate': [0.2, 0.6, 0.99, 0.68, 0.15, 0.23, 0.75, 0.17,
                                                                                    0.33, 0.65, 0.56, 0.22, 0.19, 0.11, 0.83],
                                                    'subscriber': [1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0]}
                                                    )

# Arm features for warm start
arm_to_features = {1: [0, 1, 1], 2: [0, 0.5, 0.5], 3: [1, 1, 0.5], 4: [0.2, 1, 0], 5: [0, 1, 0.1], 6: [0, 0.5, 0.5]}

# Test data to for new prediction
test_df = pd.DataFrame({'age': [37, 52], 'click_rate': [0.5, 0.6], 'subscriber': [0, 1]})
test_df_revenue = pd.Series([7, 13])

# Scale the training and test data
scaler = StandardScaler()
train = scaler.fit_transform(train_df[['age', 'click_rate', 'subscriber']])
test = scaler.transform(test_df)

########################################################
# Radius Neighborhood Policy with UCB1 Learning Policy
########################################################

# Radius contextual policy with radius equals to 5 and ucb1 learning with alpha 1.25
radius = MAB(arms=ads,
                             learning_policy=LearningPolicy.UCB1(alpha=1.25),
                             neighborhood_policy=NeighborhoodPolicy.Radius(radius=5))

# Learn from previous ads shown and revenues generated
radius.fit(decisions=train_df['ad'], rewards=train_df['revenues'], contexts=train)

# Predict the next best ad to show
prediction = radius.predict(test)

# Expectation of each ad based on learning from past ad revenues
expectations = radius.predict_expectations(test)

# Results
print("Radius: ", prediction, " ", expectations)
assert(prediction == [4, 4])

# Online update of model
radius.partial_fit(decisions=prediction, rewards=test_df_revenue, contexts=test)

# Updating of the model with new arm
radius.add_arm(6)

# Warm start new arm
radius.warm_start(arm_to_features, distance_quantile=0.75)

Parallel MAB

import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from mabwiser.mab import MAB, LearningPolicy

######################################################################################
#
# MABWiser
# Scenario: Playlist recommendation for music streaming service
#
# An online music streaming service wants to recommend a playlist to a user
# based on a user's listening history and user features. There is a large amount
# of data available to train this recommender model, which means the parallel
# functionality in MABWiser can be useful.
#
#
######################################################################################

# Seed
seed = 111

# Arms
arms = list(np.arange(100))

# Historical on user contexts and rewards (i.e. whether a user clicked
# on the recommended playlist or not)
contexts, rewards = make_classification(n_samples=100000, n_features=200,
                                        n_informative=20, weights=[0.01], scale=None)

# Independently simulate the recommended playlist for each event
decisions = np.random.choice(arms, size=100000)

# Split data into train and test data sets
contexts_train, contexts_test = train_test_split(contexts, test_size=0.3, random_state=seed)
rewards_train, rewards_test = train_test_split(rewards, test_size=0.3, random_state=seed)
decisions_train, decisions_test = train_test_split(decisions, test_size=0.3, random_state=seed)

#############################################################################
# Parallel Radius Neighborhood Policy with UCB1 Learning Policy using 8 Cores
#############################################################################

# Radius contextual policy with radius equals to 5 and ucb1 learning with alpha 1.25
radius = MAB(arms=ads,
                             learning_policy=LearningPolicy.UCB1(alpha=1.25),
                             neighborhood_policy=NeighborhoodPolicy.Radius(radius=5),
                             n_jobs=8)

# Parallel Training
# Learn from playlists shown and observed click rewards for each arm
# In reality, we can scale the data --skipping this step in the toy example here
radius.fit(decisions=decisions_train, rewards=rewards_train, contexts=contexts_train)

# Parallel Testing
# Predict the next best playlist to recommend
prediction = radius.predict(contexts_test)

# Results
print("radius: ", prediction[:10])

Simulator

import random
from sklearn.preprocessing import StandardScaler
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
from mabwiser.simulator import Simulator

######################################################################################
#
# MABWiser
# Scenario: Hyper-Parameter Tuning using the built-in Simulator capability
#
######################################################################################

# Data
size = 1000
decisions = [random.randint(0, 2) for _ in range(size)]
rewards = [random.randint(0, 1000) for _ in range(size)]
contexts = [[random.random() for _ in range(50)] for _ in range(size)]

# Bandits to simulate
n_jobs = 2
hyper_parameter_tuning = []
for radius in range(6, 10):
    hyper_parameter_tuning.append(('Radius'+str(radius),
                                  MAB([0, 1], LearningPolicy.UCB1(1),
                                      NeighborhoodPolicy.Radius(radius),
                                      n_jobs=n_jobs)))

# Simulator with given bandits and data
# The parameters uses standard scaler,
# Test split size set to 0.5
# The split is not order dependent, i.e., random split
# Online training with batch size 10, i.e., bandits will re-train at each batch
# Offline training can be run with batch_size 0, i.e., no re-training during test phase
sim = Simulator(hyper_parameter_tuning, decisions, rewards, contexts,
                scaler=StandardScaler(), test_size=0.5, is_ordered=False, batch_size=10, seed=123456)

# Run the simulator
sim.run()

# Save the results with a prefix
sim.save_results("my_results_")

# You can probe the fields of the simulator for other statisics
for mab_name, mab in sim.bandits:
    print(mab_name + "\n")

    # Since the simulation is online, print the 'total' stats
    print('Worst Case Scenario:', sim.bandit_to_arm_to_stats_min[mab_name]['total'])
    print('Average Case Scenario:', sim.bandit_to_arm_to_stats_avg[mab_name]['total'])
    print('Best Case Scenario:', sim.bandit_to_arm_to_stats_max[mab_name]['total'], "\n\n")

# Plot the average case results per every arm for each bandit
sim.plot(metric='avg', is_per_arm=True)

See also

Additional examples are available in the examples folder in the repo.