RL Fundamentals

An introduction to core RL concepts

Markov Chains and Markov Decision Processes

Markov Chain is a common model for many different problems, and it's very closely to Markov Decision Process, the widely-used decision model used in reinforcement learning. The main difference between Markov Chain (MC) and Markov Decision Process (MDP) is whether to include action as a cause for state transitions. In an MC, we have each node a state variable s (either discrete or continuous), and there's a transition operator T that defines the probability of transitioning from each state to any other state: p(st+1st)p(s_{t+1} | s_t).

It's important to assume Markov property in this world model: that is, the transition probability between each pair of states should be independent to their previous or future states. This allows us, given a current state distribution (we can define this distribution by finding the probability of each state sis_i in the state space) and the operator TT, to find the state distribution at the next time-step t+1t+1.

MDPs are different because, now instead of states, we incorporate an action to each state, and assume each next-state is a result of both current state AND current action ( st+1s_{t+1} is conditionally dependent on the joint probability (st,at)(s_t, a_t) ). Additionally, we add a reward function that maps each (s,a)(s, a) state-action pair to a scalar value rr.

One more consideration is that, sometimes the states in the environment are not fully observable, and the agent instead only sees an observation, as a partial reflection of information in the current state.

Now we are ready to define the problem of reinforcement learning: in our decision model to navigate the world, we don't have control over the transition between current state+action to the next state, but we do have a choice of which action a to take at each time-step, and we can either define or derive a reward function that maps each state-action pair to a scalar r that indicates its ideal-ness as part of a trajectory towards achieving certain goal state. Therefore, our learning objective for this agent is to come up with a best action-choosing strategy, here we call it policy, denoted by π\pi that maximizes the total reward we can collect along a trajectory, i.e. a series of state-action pairs experienced under some policy.

Note how often we use probability in RL problems: policy is defined as an action distribution conditioned on a given state. This is a more general definition than deterministically assigning exactly one action to one state: it's easier to differentiate and accounts for cases where certain randomness in behavior is desired; if we want deterministic, we can just assign probability 1 to one of the actions and 0 to everything else.

Now, the state-action nodes still form a Markov Chain, but as shown above, each P(\text{next_node} | \text{current_node}) can be expanded so that we have control over its transition probabilities.

RL Algorithms Anatomy

As shown above, for all the RL algorithms that try to solve this agent learning task, we can separate each of them into roughly three parts that work iteratively until a desired policy is obtained. This graph is pretty straightforward, so I'll stop here for now and we can see later, with more concrete examples, how different algorithms propose different approaches to these sub-tasks.

All included screenshots credit to lecture 4 slides

Last updated