Special thanks to this OpenAI paper, this youtube video

MDP (Markov-Decision-Process)

Goal

<aside>

Now, after identifying the models, and the goal of RL, we can march forward towards the next step which is:

</aside>

Finding the optimal policy that maximizes the expected cumulative reward

Policy

<aside>

As for every time that we perform model training, we have no idea of what the optimal policy is, and finding the optimal policy ($\pi^*$) is exactly the goal of the algorithms that we are going to introduce later, based on the the oh-so-scary: Value functions.

</aside>

Dynamic Programming

Value Functions and the Bellman Equations