POMDPs

At (discrete) time step tt, the environment is assumed to be in some state XtX_t. The agent then performs an action (control) AtA_t, whereupon the environment (stochastically) changes to a new state Xt+1X_{t+1}. The agent doesn’t see the environment state, but instead receives an observation YtY_t, which is some (stochastic) function of XtX_t. (If Yt=XtY_t = X_t, the POMDP reduces to a fully observed MDP.) In addition, the agent receives a special observation signal called the reward, RtR_t. The POMDP is characterized by the state transition function P(Xt+1Xt,At)P(X_{t+1}|X_t, A_t), the observation function P(YtXt,At1)P(Y_t|X_t, A_{t−1}), and the reward function E(RtXt,At1)E(R_t|X_t, A_{t−1}). The goal of the agent is to learn a policy π\pi which maps the observation history (trajectory) into an action AtA_t to maximize π\pi’s quality or value.

Related Problems


Insufficient data to display graph

Filters

Computational Model

Randomization

Approximation

Algorithms Table

Insuffient Data to display table

Reductions Table

Insuffient Data to display table

Other relevant algorithms

Displaying 13 of 13 other relevant algorithms