.. _examples:

Usage Examples
==============

The examples below show how to use different bandit policies to make decisions among multiple arms based on their expected rewards.

Conceptually, given a set of historical decisions and their corresponding rewards,
the high-level idea behind MABWiser is to train a model using the ``fit()`` method to make predictions
about next best decisions using the ``predict()`` method.

It is possible to retrieve the expected reward of each arm using the ``predict_expectations()`` method and
online training is available using the ``partial_fit()`` method.
New arms can be added to the bandits using the ``add_arm()`` method.
Decisions and rewards data support lists, 1D numpy arrays, and pandas series.
Contexts data supports 2D lists, 2D numpy arrays, pandas series and data frames.

.. seealso:: Additional examples are available in the `examples folder`_ in the repo.

Context-Free MAB
----------------

.. code-block:: python
   
	from mabwiser.mab import MAB, LearningPolicy

    ######################################################################################
    #
    # MABWiser
    # Scenario: A/B Testing for Website Layout Design
    #
    # An e-commerce website experiments with 2 different layouts options for their homepage
    # Each layouts decision leads to generating different revenues
    #
    # What should the choice of layouts be based on historical data?
    #
    ######################################################################################

    # Arms
    options = [1, 2]

    # Historical data of layouts decisions and corresponding rewards
    layouts = [1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 1]
    revenues = [10, 17, 22, 9, 4, 0, 7, 8, 20, 9, 50, 5, 7, 12, 10]

    # Arm to features
    arm_to_features = {1: [0, 0, 1], 2: [1, 1, 0], 3: [1, 1, 0]}

    ###################################
    # Epsilon Greedy Learning Policy
    ###################################

    # Epsilon Greedy learning policy with random exploration set to 15%
    greedy = MAB(arms=options,
                 learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.15),
                 seed=123456)

    # Learn from previous layouts decisions and revenues generated
    greedy.fit(decisions=layouts, rewards=revenues)

    # Predict the next best layouts decision
    prediction = greedy.predict()

    # Expected revenues of each layouts learnt from historical data based on epsilon greedy policy
    expectations = greedy.predict_expectations()

    # Results
    print("Epsilon Greedy: ", prediction, " ", expectations)
    assert(prediction == 2)

    # Additional historical data becomes available which allows online learning
    additional_layouts = [1, 2, 1, 2]
    additional_revenues = [0, 12, 7, 19]

    # Online updating of the model
    greedy.partial_fit(additional_layouts, additional_revenues)

    # Adding a new layout option
    greedy.add_arm(3)

    # Warm start new arm
    greedy.warm_start(arm_to_features, distance_quantile=0.5)


Parametric Contextual MAB
-------------------------

.. code-block:: python

    import pandas as pd
    from sklearn.preprocessing import StandardScaler

    from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy

    ######################################################################################
    #
    # MABWiser
    # Scenario: Advertisement Optimization
    #
    # An e-commerce website needs to solve the problem of which ad to display to online users
    # Each advertisement decision leads to generating different revenues
    #
    # What should the choice of advertisement be given the context of an online user
    # based on customer data such as age, click rate, subscriber?
    #
    ######################################################################################

    # Arms
    ads = [1, 2, 3, 4, 5]

    # Historical data of ad decisions with corresponding revenues and context information
    train_df = pd.DataFrame({'ad': [1, 1, 1, 2, 4, 5, 3, 3, 2, 1, 4, 5, 3, 2, 5],
							 'revenues': [10, 17, 22, 9, 4, 20, 7, 8, 20, 9, 50, 5, 7, 12, 10],
							 'age': [22, 27, 39, 48, 21, 20, 19, 37, 52, 26, 18, 42, 55, 57, 38],
							 'click_rate': [0.2, 0.6, 0.99, 0.68, 0.15, 0.23, 0.75, 0.17,
											0.33, 0.65, 0.56, 0.22, 0.19, 0.11, 0.83],
							 'subscriber': [1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0]}
							 )

    # Arm features for warm start
    arm_to_features = {1: [0, 1, 1], 2: [0, 0.5, 0.5], 3: [1, 1, 0.5], 4: [0.2, 1, 0], 5: [0, 1, 0.1], 6: [0, 0.5, 0.5]}

    # Test data to for new prediction
    test_df = pd.DataFrame({'age': [37, 52], 'click_rate': [0.5, 0.6], 'subscriber': [0, 1]})
    test_df_revenue = pd.Series([7, 13])

    # Scale the training and test data
    scaler = StandardScaler()
    train = scaler.fit_transform(train_df[['age', 'click_rate', 'subscriber']])
    test = scaler.transform(test_df)

    ##################################################
    # Linear Upper Confidence Bound Learning Policy
    ##################################################

    # LinUCB learning policy with alpha 1.25 and l2_lambda 1
    linucb = MAB(arms=ads,
				 learning_policy=LearningPolicy.LinUCB(alpha=1.25, l2_lambda=1))

    # Learn from previous ads shown and revenues generated
    linucb.fit(decisions=train_df['ad'], rewards=train_df['revenues'], contexts=train)

    # Predict the next best ad to show
    prediction = linucb.predict(test)

    # Expectation of each ad based on learning from past ad revenues
    expectations = linucb.predict_expectations(test)

    # Results
    print("LinUCB: ", prediction, " ", expectations)
    assert(prediction == [5, 2])

    # Online update of model
    linucb.partial_fit(decisions=prediction, rewards=test_df_revenue, contexts=test)

    # Update the model with new arm
    linucb.add_arm(6)

    # Warm start new arm
    linucb.warm_start(arm_to_features, distance_quantile=0.75)


Non-Parametric Contextual MAB
-----------------------------

.. code-block:: python

    import pandas as pd
    from sklearn.preprocessing import StandardScaler

    from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy

    ######################################################################################
    #
    # MABWiser
    # Scenario: Advertisement Optimization
    #
    # An e-commerce website needs to solve the problem of which ad to display to online users
    # Each advertisement decision leads to generating different revenues
    #
    # What should the choice of advertisement be given the context of an online user
    # based on customer data such as age, click rate, subscriber?
    #
    ######################################################################################

    # Arms
    ads = [1, 2, 3, 4, 5]

    # Historical data of ad decisions with corresponding revenues and context information
    train_df = pd.DataFrame({'ad': [1, 1, 1, 2, 4, 5, 3, 3, 2, 1, 4, 5, 3, 2, 5],
							 'revenues': [10, 17, 22, 9, 4, 20, 7, 8, 20, 9, 50, 5, 7, 12, 10],
							 'age': [22, 27, 39, 48, 21, 20, 19, 37, 52, 26, 18, 42, 55, 57, 38],
							 'click_rate': [0.2, 0.6, 0.99, 0.68, 0.15, 0.23, 0.75, 0.17,
											0.33, 0.65, 0.56, 0.22, 0.19, 0.11, 0.83],
							'subscriber': [1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0]}
							)

    # Arm features for warm start
    arm_to_features = {1: [0, 1, 1], 2: [0, 0.5, 0.5], 3: [1, 1, 0.5], 4: [0.2, 1, 0], 5: [0, 1, 0.1], 6: [0, 0.5, 0.5]}

    # Test data to for new prediction
    test_df = pd.DataFrame({'age': [37, 52], 'click_rate': [0.5, 0.6], 'subscriber': [0, 1]})
    test_df_revenue = pd.Series([7, 13])

    # Scale the training and test data
    scaler = StandardScaler()
    train = scaler.fit_transform(train_df[['age', 'click_rate', 'subscriber']])
    test = scaler.transform(test_df)

    ########################################################
    # Radius Neighborhood Policy with UCB1 Learning Policy
    ########################################################

    # Radius contextual policy with radius equals to 5 and ucb1 learning with alpha 1.25
    radius = MAB(arms=ads,
				 learning_policy=LearningPolicy.UCB1(alpha=1.25),
				 neighborhood_policy=NeighborhoodPolicy.Radius(radius=5))

    # Learn from previous ads shown and revenues generated
    radius.fit(decisions=train_df['ad'], rewards=train_df['revenues'], contexts=train)

    # Predict the next best ad to show
    prediction = radius.predict(test)

    # Expectation of each ad based on learning from past ad revenues
    expectations = radius.predict_expectations(test)

    # Results
    print("Radius: ", prediction, " ", expectations)
    assert(prediction == [4, 4])

    # Online update of model
    radius.partial_fit(decisions=prediction, rewards=test_df_revenue, contexts=test)

    # Updating of the model with new arm
    radius.add_arm(6)

    # Warm start new arm
    radius.warm_start(arm_to_features, distance_quantile=0.75)


Parallel MAB 
------------

.. code-block:: python

    import numpy as np
    from sklearn.datasets import make_classification
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import train_test_split

    from mabwiser.mab import MAB, LearningPolicy

    ######################################################################################
    #
    # MABWiser
    # Scenario: Playlist recommendation for music streaming service
    #
    # An online music streaming service wants to recommend a playlist to a user
    # based on a user's listening history and user features. There is a large amount
    # of data available to train this recommender model, which means the parallel
    # functionality in MABWiser can be useful.
    #
    #
    ######################################################################################

    # Seed
    seed = 111

    # Arms
    arms = list(np.arange(100))

    # Historical on user contexts and rewards (i.e. whether a user clicked
    # on the recommended playlist or not)
    contexts, rewards = make_classification(n_samples=100000, n_features=200,
                                            n_informative=20, weights=[0.01], scale=None)

    # Independently simulate the recommended playlist for each event
    decisions = np.random.choice(arms, size=100000)

    # Split data into train and test data sets
    contexts_train, contexts_test = train_test_split(contexts, test_size=0.3, random_state=seed)
    rewards_train, rewards_test = train_test_split(rewards, test_size=0.3, random_state=seed)
    decisions_train, decisions_test = train_test_split(decisions, test_size=0.3, random_state=seed)

    #############################################################################
    # Parallel Radius Neighborhood Policy with UCB1 Learning Policy using 8 Cores
    #############################################################################

    # Radius contextual policy with radius equals to 5 and ucb1 learning with alpha 1.25
    radius = MAB(arms=ads,
				 learning_policy=LearningPolicy.UCB1(alpha=1.25),
				 neighborhood_policy=NeighborhoodPolicy.Radius(radius=5),
				 n_jobs=8)
				 
    # Parallel Training
    # Learn from playlists shown and observed click rewards for each arm
    # In reality, we can scale the data --skipping this step in the toy example here
    radius.fit(decisions=decisions_train, rewards=rewards_train, contexts=contexts_train)

    # Parallel Testing
    # Predict the next best playlist to recommend
    prediction = radius.predict(contexts_test)

    # Results
    print("radius: ", prediction[:10])


Simulator
---------

.. code-block:: python

    import random
    from sklearn.preprocessing import StandardScaler
    from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
    from mabwiser.simulator import Simulator

    ######################################################################################
    #
    # MABWiser
    # Scenario: Hyper-Parameter Tuning using the built-in Simulator capability
    #
    ######################################################################################

    # Data
    size = 1000
    decisions = [random.randint(0, 2) for _ in range(size)]
    rewards = [random.randint(0, 1000) for _ in range(size)]
    contexts = [[random.random() for _ in range(50)] for _ in range(size)]

    # Bandits to simulate
    n_jobs = 2
    hyper_parameter_tuning = []
    for radius in range(6, 10):
        hyper_parameter_tuning.append(('Radius'+str(radius),
                                      MAB([0, 1], LearningPolicy.UCB1(1),
                                          NeighborhoodPolicy.Radius(radius),
                                          n_jobs=n_jobs)))

    # Simulator with given bandits and data
    # The parameters uses standard scaler,
    # Test split size set to 0.5
    # The split is not order dependent, i.e., random split
    # Online training with batch size 10, i.e., bandits will re-train at each batch
    # Offline training can be run with batch_size 0, i.e., no re-training during test phase
    sim = Simulator(hyper_parameter_tuning, decisions, rewards, contexts,
                    scaler=StandardScaler(), test_size=0.5, is_ordered=False, batch_size=10, seed=123456)

    # Run the simulator
    sim.run()

    # Save the results with a prefix
    sim.save_results("my_results_")

    # You can probe the fields of the simulator for other statisics
    for mab_name, mab in sim.bandits:
        print(mab_name + "\n")

        # Since the simulation is online, print the 'total' stats
        print('Worst Case Scenario:', sim.bandit_to_arm_to_stats_min[mab_name]['total'])
        print('Average Case Scenario:', sim.bandit_to_arm_to_stats_avg[mab_name]['total'])
        print('Best Case Scenario:', sim.bandit_to_arm_to_stats_max[mab_name]['total'], "\n\n")

    # Plot the average case results per every arm for each bandit
    sim.plot(metric='avg', is_per_arm=True)


.. seealso:: Additional examples are available in the `examples folder`_ in the repo. 

.. _examples folder: https://github.com/fidelity/mabwiser/tree/master/examples