The Oil Discovery Problem

Description

This problem, adaptved from here is a continuous variant of the “Grid World” environment. It comprises of an agent surveying a d-dimensional map in search of hidden “oil deposits”. The world is endowed with an unknown survey function which encodes the probability of observing oil at that specific location. For agents to move to a new location they pay a cost proportional to the distance moved, and surveying the land produces noisy estimates of the true value of that location. In addition, due to varying terrain the true location the agent moves to is perturbed as a function of the state and action.

oil_problem.py is a \(d\)-dimensional reinforcement learning environment in the space \(X = [0, 1]^d\). The action space \(A = [0,1]^d\) corresponding to the ability to attempt to move to any desired location within the state space. On top of that, there is a corresponding reward function \(f_h(x,a)\) for the reward for moving the agent to that location. Moving also causes an additional cost \(\alpha d(x,a)\) scaling with respect to the distance moved.

Dynamics

State Space

The state space for the line environment is \(S = X^d\) where \(X = [0, 1]\) and there are \(d\) dimensions.

Action space

The agent chooses a location to move to, and so the action space is also \(A = X^d\) where \(X = [0,1]\) and there are \(d\) dimensions.

Reward

The reward is \(\text{oil prob}(s, a, h) - \alpha \sum_i |s_i - a_i|\) where \(s\) is the previous state of the system, \(a\) is the action chosen by the user, \(\text{oil prob}\) is a user specified reward function, and \(\alpha\) dictates the cost tradeoff for movement. Clearly when \(\alpha = 0\) then the optimal policy is to just take the action that maximizes the resulting oil probability function.

The \(\alpha\) parameter though more generally allows the user to control how much to penalize the agent for moving.

Transitions

Given an initial state at the start of the iteration \(s\), an action chosen by the user \(a\), the next state will be \(\begin{align*} s_{new} = a + \text{Normal}(0, \sigma(s,a,h)) \end{align*}\) where \(\sigma(s,a,h)\) is a user-specified function corresponding to the variance in movement.

Environment

Metric

reset

Returns the environment to its original state.

step(action)

Takes an action from the agent and returns the state of the system.

Returns:

state: A list containing the new location of the agent
reward: The reward associated with the most recent action and event
pContinue:
info: empty

render

Currently unimplemented

close

Currently unimplemented

Init parameters for the line ambulance environment, passed in using a dictionary named CONFIG

epLen: the length of each episode
dim: the dimension of the problem
alpha: a float \(\in [0,1]\) that controls the proportional difference between the cost to move
oil_prob: a function corresponding to the reward for moving to a new location
noise_variance: a function corresponding to the variance for movement
starting_state: an element in \([0,1]^{dim}\)

Heuristic Agents

There are no currently implemented heuristic agents for this environment.