The Oil Discovery Problem
Description
This problem, adaptved from here is a continuous variant of the “Grid World” environment. It comprises of an agent surveying a d-dimensional map in search of hidden “oil deposits”. The world is endowed with an unknown survey function which encodes the probability of observing oil at that specific location. For agents to move to a new location they pay a cost proportional to the distance moved, and surveying the land produces noisy estimates of the true value of that location. In addition, due to varying terrain the true location the agent moves to is perturbed as a function of the state and action.
oil_problem.py is a \(d\)-dimensional reinforcement learning
environment in the space \(X = [0, 1]^d\). The action space
\(A = [0,1]^d\) corresponding to the ability to attempt to move to
any desired location within the state space. On top of that, there is a
corresponding reward function \(f_h(x,a)\) for the reward for moving
the agent to that location. Moving also causes an additional cost
\(\alpha d(x,a)\) scaling with respect to the distance moved.
Dynamics
State Space
The state space for the line environment is \(S = X^d\) where \(X = [0, 1]\) and there are \(d\) dimensions.
Action space
The agent chooses a location to move to, and so the action space is also \(A = X^d\) where \(X = [0,1]\) and there are \(d\) dimensions.
Reward
The reward is \(\text{oil prob}(s, a, h) - \alpha \sum_i |s_i - a_i|\) where \(s\) is the previous state of the system, \(a\) is the action chosen by the user, \(\text{oil prob}\) is a user specified reward function, and \(\alpha\) dictates the cost tradeoff for movement. Clearly when \(\alpha = 0\) then the optimal policy is to just take the action that maximizes the resulting oil probability function.
The \(\alpha\) parameter though more generally allows the user to control how much to penalize the agent for moving.
Transitions
Given an initial state at the start of the iteration \(s\), an action chosen by the user \(a\), the next state will be \(\begin{align*} s_{new} = a + \text{Normal}(0, \sigma(s,a,h)) \end{align*}\) where \(\sigma(s,a,h)\) is a user-specified function corresponding to the variance in movement.
Environment
Metric
reset
Returns the environment to its original state.
step(action)
Takes an action from the agent and returns the state of the system.
Returns:
state: A list containing the new location of the agentreward: The reward associated with the most recent action and eventpContinue:info: empty
render
Currently unimplemented
close
Currently unimplemented
Init parameters for the line ambulance environment, passed in using a dictionary named CONFIG
epLen: the length of each episodedim: the dimension of the problemalpha: a float \(\in [0,1]\) that controls the proportional difference between the cost to moveoil_prob: a function corresponding to the reward for moving to a new locationnoise_variance: a function corresponding to the variance for movementstarting_state: an element in \([0,1]^{dim}\)
Heuristic Agents
There are no currently implemented heuristic agents for this environment.