Discrete MB Agent

class or_suite.agents.rl.discrete_mb.DiscreteMB(action_space, state_space, epLen, scaling, alpha, flag)[source]

Uniform model-based algorithm implemented for MultiDiscrete enviroments and actions using the metric induces by the l_inf norm

epLen

(int) number of steps per episode

scaling

(float) scaling parameter for confidence intervals

action_space

(MultiDiscrete) the action space

state_space

(MultiDiscrete) the state space

action_size

(list) representing the size of the action sapce

state_size

(list) representing the size of the state sapce

alpha

(float) parameter for prior on transition kernel

flag

(bool) for whether to do full step updates or not

matrix_dim

(tuple) a concatenation of epLen, state_size, and action_size used to create the estimate arrays of the appropriate size

qVals

(list) The Q-value estimates for each episode, state, action tuple

num_visits

(list) The number of times that each episode, state, action tuple has been visited

vVals

(list) The value function values for every step, state pair

rEst

(list) Estimates of the reward for a step, state, action tuple

pEst

(list) Estimates of the number of times that each step, state, action, new_state tuple is considered

__init__(action_space, state_space, epLen, scaling, alpha, flag)[source]

Initialize self. See help(type(self)) for accurate signature.

pick_action(state, step)[source]

Select action according to a greedy policy

Parameters
  • state – int - current state

  • step – int - timestep within episode

Returns

action

Return type

list

reset()[source]

Resets the agent by overwriting all of the estimates back to initial values

update_obs(obs, action, reward, newObs, timestep, info)[source]

Add observation to records

Parameters
  • obs – (list) The current state

  • action – (list) The action taken

  • reward – (int) The calculated reward

  • newObs – (list) The next observed state

  • timestep – (int) The current timestep

update_parameters(param)[source]

Update the scaling parameter. :param param: (int) The new scaling value to use

update_policy(k)[source]

Update internal policy based upon records