The Multi-Armed Bandit Problem

Description

The Multi-Armed Bandit Problem (MAB, or often called K or N-armed bandit problems) is a problem where a fixed set of limied resources must be allocated between competing choices in a way that maximizes their expected gain, when the underlying rewards is not known at the start of learning. This is a classic reinforcement learning problem that exemplifies the exploration-exploitation tradeoff dilema. The crucial tradeoff the algorithm faces at each trial is between “exploitation” of the arm that has the highest expected payoff and “exploration” to get more information about the expected payoffs of the other arms. The trade-off between exploration and exploitation is also faced in machine learning.

Dynamics

State Space

The state space is represented as \(X = [K]^T\) where \(K\) is the number of arms and \(T\) is the number of timesteps. Each component represents the number of times the arm has been pulled up to the current iteration.

Action space

The action space is \([K]\) representing the index of the arm selected at that time instant.

Reward

The reward is calculated via \(r(x,a)\) taken as a random sample from a specified distribution \(\mu(a)\).

Transitions

From state \(x\) having taking action \(a\) the agent transitions to a new state \(x'\) where \(x'[a]\) is incremented by one to denote the increment that the arm \(a\) has been selected an extra time.

Environment

Line

reset

Returns the environment to its original state.

step(action)

Takes an action from the agent and returns the state of the system after the next arrival. * action: the index of the selected arm

Returns:

  • state: The number of times each arm has been selected so far

  • reward: The reward drawn from the distribution specified by the given action.

  • pContinue:

  • info: Empty

render

Currently unimplemented

close

Currently unimplemented

Heuristic Agents

We currently have no heuristic algorithms implemented for this environment.