CS 3891/CS 5891: Introduction to Reinforcement Learning

Course Description: This course introduces students to the theory and practice of Reinforcement Learning (RL). RL problems involve learning what to do, i.e., how to map situations to actions to maximize a numerical reward signal. The course covers model-based and model-free reinforcement learning methods, especially those based on temporal difference learning and policy gradient algorithms. This will include the essentials of RL theory and its applications to real-world sequential decision problems. RL is an essential part of fields ranging from modern robotics to game playing (e.g. Poker, Go, and Starcraft), and RL applications are now being extended to the control of complex cyber-physical systems (CPS) that operate in continuous time. The material covered in this class will provide an understanding of the core fundamentals of reinforcement learning, preparing students to apply it to problems of their choosing, as well as allowing them to understand modern RL research.

Text: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. 2^nd Edition, MIT Press.

Topics covered in this course will include

What is Reinforcement Learning?
- How does it differ from supervised and unsupervised learning?
- Maximizing a reward signal; Exploration versus Exploitation
- Goal-directed agent operating in an uncertain environment
- Sequential decision-making
Sequential Decision Making (Agent-Environment interactions)
- Finite Markov Decision Processes
- Goals and Rewards (Markov Reward Process)
- Value Functions (Bellman equations) and Policies
- Optimal Solutions (Bellman Optimality equation)
Dynamic Programming (Optimal Policy Computation)
- Policy Evaluation
- Policy Improvement
- Policy Iteration
- Value Iteration
- Asynchronous Dynamic Programming
- Generalized Policy Improvement
Monte Carlo (MC) methods (when we don’t have a model of how the world works)
- MC Prediction
- MC Estimation and Action Values
- MC Control (On-Policy method) – First visit and every visit algorithms
- Off-Policy MC Control using Importance Sampling
Temporal-Difference (TD) Learning (Combination of Monte Carlo & dynamic programming methods)
- TD(0) learning algorithm
- SARSA: On-policy TD control
- Q-learning: Off-policy TD-control
- Extensions to n-step Bootstrapping
On-Policy Prediction with Approximation
- Value Function Approximation (VFA)
  - Linear VFA
  - MC VFA
  - TD(0) VFA
- Deep RL with VFA
  - Deep NN, CNN, DQN
  - Deep NN representations of value and Q functions, policy, and model
- Deep Reinforcement Learning with Double Q-Learning
- Prioritized Experience Replay
- Dueling Network Architectures for Deep Reinforcement Learning
Policy-Gradient Methods
- Policy Approximations
- Policy Search methods
- Gradient Free methods
- Finite Difference Methods
- Likelihood Ratio Policy Gradient
- REINFORCE: MC Policy Gradient
- Actor-Critic Methods: A3C
- Updating Parameters Given the Gradient: TRPO
Fast Learning
- Introduction to Multi-arm Bandits
- Multi-arm Bandit Greedy Algorithm
- Bayesian Bandits
- Bayesian Regret and Probability Matching
- Fast RL in MDPs
- Fast RL in Bayesian MDPs
Batch RL
- Introduction and Batch RL setting
- Offline batch evaluation using models
- Offline batch evaluation using Q functions
- Offline batch evaluation using importance sampling
Monte Carlo Tree Search (MCTS)
- Simulation-based Search
- Application – Game of Go
Applications in Game Playing, Planning, and Control