Special thanks to this OpenAI paper, this youtube video
<aside>
Now, after identifying the models, and the goal of RL, we can march forward towards the next step which is:
</aside>
Finding the optimal policy that maximizes the expected cumulative reward
<aside>
As for every time that we perform model training, we have no idea of what the optimal policy is, and finding the optimal policy ($\pi^*$) is exactly the goal of the algorithms that we are going to introduce later, based on the the oh-so-scary: Value functions.
</aside>