Discrete MB Agent

class or_suite.agents.rl.discrete_mb.DiscreteMB(action_space, state_space, epLen, scaling, alpha, flag)[source]

Uniform model-based algorithm implemented for MultiDiscrete enviroments and actions using the metric induces by the l_inf norm

epLen: (int) number of steps per episode

scaling: (float) scaling parameter for confidence intervals

action_space: (MultiDiscrete) the action space

state_space: (MultiDiscrete) the state space

action_size: (list) representing the size of the action sapce

state_size: (list) representing the size of the state sapce

alpha: (float) parameter for prior on transition kernel

flag: (bool) for whether to do full step updates or not

matrix_dim: (tuple) a concatenation of epLen, state_size, and action_size used to create the estimate arrays of the appropriate size

qVals: (list) The Q-value estimates for each episode, state, action tuple

num_visits: (list) The number of times that each episode, state, action tuple has been visited

vVals: (list) The value function values for every step, state pair

rEst: (list) Estimates of the reward for a step, state, action tuple

pEst: (list) Estimates of the number of times that each step, state, action, new_state tuple is considered

__init__(action_space, state_space, epLen, scaling, alpha, flag)[source]: Initialize self. See help(type(self)) for accurate signature.

pick_action(state, step)[source]

Select action according to a greedy policy

Parameters

state – int - current state
step – int - timestep within episode

Returns

action

Return type

list

reset()[source]: Resets the agent by overwriting all of the estimates back to initial values

update_obs(obs, action, reward, newObs, timestep, info)[source]

Add observation to records

Parameters

obs – (list) The current state
action – (list) The action taken
reward – (int) The calculated reward
newObs – (list) The next observed state
timestep – (int) The current timestep

update_parameters(param)[source]: Update the scaling parameter. :param param: (int) The new scaling value to use

update_policy(k)[source]: Update internal policy based upon records