csc490

CSC 490 Winter 2026

ML Engineering Capstone

This hands-on course will combine ML engineering theory with hands on building ML and AI systems. Topics include AI product management, Cloud Infrastructure, Model Training, Model Inference and ML Systems Design.

Throughout the semester there will be guest lectures to provide industry insight on topics covered in class. The class will culminate in a final presentation on a research or product based project.

More details can be found in syllabus and piazza.

Announcements:

A5 released!
A4 released!
A3 released!
Project guidelines released!
A3 deadline extended Mar 6th
A2 deadline extended Feb 20th
Virtual Lecture Jan 26th, snowday!
Lectures begin on Jan 5th!

Instructors:

Instructor	Denys Linkov
Email	csc490-2026-01@cs.toronto.edu
Office hours	By Appointment

Teaching Assistants:

Wen Xu
Alice Chua
Angela Sha

Time & Location:

Section	Lecture	Tutorial
CSC490H1-W-LEC5101	M 6-9pm @ DSCIL	N/A
CSC490H1-W-LEC5201	W 6-9pm @ DSCIL	N/A

Lectures and timeline

Week	Lectures	Suggested reading	Tutorials	Guest Lectures
1	The AI Landscape and AI product management	Strategyzer — The Value Proposition Canvas Brown et al. — Language Models Are Few‑Shot Learners (2020) Dettmers — bitsandbytes: 8‑bit optimizers / quantization (2023) Radford et al. — Learning Transferable Visual Models from Natural Language Supervision (2021) Ambrosio — Achieving Human Level Competitive Robot Table Tennis (2024) Xia — Pubsub latency comparison article (2021) Doshi — Good Product Managers, Great (2020) Rachitsky — Product Management: Startup vs Big Company(2020) First Round Review — How to craft your product team at every stage		Ivan Zhang - Cohere
2	Intro to Docker, Kubernetes, cloud, Terraform and architecture diagrams	Tokenizer Example Sebastian Raschka - From GPT-2 to gpt-oss: Analyzing the Architectural Advances (2025) Kubernetes Architecture Chen et al. - EXTENDING CONTEXT WINDOW OF LARGE LANGUAGE MODELS VIA POSITION INTERPOLATION (2023) DZone Feature Flags (2015) DHH - Why we left the cloud (2023)	Kubernetes Tutorial	Jeff Wang - Windsurf/Cognition Kashaf Salaheen - Hashicorp Paul Richardson - Eng Leader
3	Evaluating ML products			Michael Jodha - RBC
4	Prompting and Constrained decoding CSC412MOELecture CSC412Website	Sutskever et al. — Sequence to Sequence Learning with Neural Networks (2014) McCann et al. — The Natural Language Decathlon: Multitask Learning as Question Answering (2018) Radford et al. — Language Models are Unsupervised Multitask Learners (2019) Raffel et al. — Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer (2019) Lewis et al. — Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks (2020) He et al. — Defeating Nondeterminism in LLM Inference (2025) Beurer-Kellner et al. — Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation (2024) Snell et al. — Scaling LLM Test‑Time Compute Optimally can be More Effective than Scaling Model Parameters (2024) Wallace et al. — The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (2024)	Constrained decoding Colab	Debangshu Banerjee - UIUC
5	Model serving deep dive 1 - LORA and Model Serving	Flash Attention - (Dao, 2022) Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference - (Recasens, 2025) NVIDIA Hopper Architecture In-Depth - (Nvidia, 2022) Optimizing BERT Inference BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents (AWS, 2020) Set fit batch sizes (Huggingface) Parralelism Forms (Nvidia,2023) LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. (Hu,2021) Host concurrent LLMs with LoRAX - (AWS,2025)	VLLM Serving tutorial
6	Feature stores and Evaluation metrics	What is a feature store - Tecton Just Use Postgres for Everything - (Stephan Schmidt, 2025) Offline to Online: Feature Storage for Real-time Recommendation Systems with NVIDIA Merlin - (Partee,2023) Offline to Online: Feature Storage for Real-time Recommendation Systems with NVIDIA Merlin - (Partee,2023) Optimal Feature Discovery: Better, Leaner Machine Learning Models Through Information Theory - (Uber, 2021) From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix - (Netflix, 2025) Google ML Glossery	Eval tutorial	Susan Shu Chang - Elastic
7	Reading week (No Class/Tutorial)
8	Model training - Pre-training, SFT	Hoffmann et al. — Training Compute-Optimal Large Language Models (2022) Grattafiori et al. — The Llama 3 Herd of Models (2024) Raffel et al. — Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019) Goyal et al. — Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (2018) Chowdhery et al. — PaLM: Scaling Language Modeling with Pathways (2022) Micikevicius et al. — Mixed Precision Training (2018) NVIDIA — Pretraining Large Language Models with NVFP4 (2025) Wang et al. — Text Embeddings by Weakly-Supervised Contrastive Pre-training (2024) Vera et al. — EmbeddingGemma: Powerful and Lightweight Text Representations (2025) Schroff et al. — FaceNet: A unified embedding for face recognition and clustering (2015) Mattson et al. — MLPerf Training Benchmark (2020) Allal et al. — SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model (2025)	Nanochat Tutorial	Carlos Arguelles - Amazon Cameron R. Wolfe - Netflix
9	Model training - Reinforcement learning and Prompting, PPO, GRPO, RLHF	Schulman et al. — Proximal Policy Optimization Algorithms (2017) Ziegler et al. — Fine-Tuning Language Models from Human Preferences (2019) Ouyang et al. — Training language models to follow instructions with human feedback (InstructGPT) (2022) Rafailov et al. — Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023) Guo et al. — DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024) Guo et al. — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025) Fu et al. — AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning (2025) Kimi Team — Kimi K2: Open Agentic Intelligence (2025) Piché et al. — PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation (2025) Khatri et al. — The Art of Scaling Reinforcement Learning Compute for LLMs (2025) Ling Team — Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model (2025) Alibaba — Qwen 3.5: Towards Native Multimodal Agents (2026) Abouzaid et al. — First Proof (2026) Sebastian Raschka — LLM Research Papers 2025 List (2025)		Will Brown - Prime Intellect
10	Scalability of ML Systems	AWS — The Difference Between ETL and ELT Crunchy Data — Parquet and Postgres in the Data Lake Bengio et al. — Curriculum Learning (2009) Sachdeva et al. — Data-Juicer: A One-Stop Data Processing System for Large Language Models (2024) Liu et al. — RegMix: Mixture-of-Experts for Data-Efficient Language Model Pre-training (2024) Allal et al. — SmolLM2: Scorey, Sassy, and Smol (2024) Penedo et al. — The FineWeb Datasets: Decanting the Web for the Finest 15T Tokens (2024) Cormack et al. — Reciprocal Rank Fusion out-performs Condorcet and individual Rank Learning Methods (2009) Thakur et al. — BEIR: A Heterogeneous Benchmark for Information Retrieval (2021) Muennighoff et al. — MTEB: Massive Text Embedding Benchmark (2023) Enevoldsen et al. — MTEB 2: The Next Generation of Text Embedding Evaluation (2025)		Tutorial on MoE and Expert Parralelism	Marcel Kornacker - Pixatable Tyler Han - Voiceflow
11	Model serving deep dive 2 - Speculative decoding & KV caching, Quantization and CUDA	Sebastian Raschka — Coding the KV Cache in LLMs (2024) Xiao et al. — SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (2022) NVIDIA — Mastering LLM Techniques: Inference Optimization (2023) Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) (2023) Zhong et al. — DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving (2024) Hu et al. — Inference without Interference: Disaggregate LLM Serving for Higher Throughput and Lower Latency (2024) Leviathan et al. — Fast Inference from Transformers via Speculative Decoding (2023) Cai et al. — Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads (2024) Bachmann et al. — Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment (2025) Li et al. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test OpenAI — Introducing gpt-oss: Open-weight Reasoning Models (2025) Dettmers et al. — LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022) Maarten Grootendorst — A Visual Guide to Quantization (2024) NVIDIA — Optimizing LLMs for Performance and Accuracy with Post-Training Quantization (2023) NVIDIA — Pretraining Large Language Models with NVFP4 (2025) Liu et al. — LLM-QAT: Data-Free Quantization Aware Training for Large Language Models (2023) Dettmers et al. — QLoRA: Efficient Finetuning of Quantized LLMs (2023) NVIDIA — Blackwell InferenceMax Benchmark Results (2025) NVIDIA — CUDA C Programming Guide Modal — GPU Glossary NVIDIA — NVIDIA Hopper Architecture Whitepaper (2022) NASA — Basics on NVIDIA GPU Hardware Architecture	Speculative Decoding Tutorial	Adil Asif - Nvidia Chris Smith - AMD
12	Search and Recommender systems			Devansh Tandon - Meta
13	Final Presentations

Assignments

Assignment #	Out	Due
Assignment 1 — 10%	Jan 5	Jan 23 - 10:59pm
Assignment 2 — 10%	Jan 24	Feb 20 - 10:59pm
Assignment 3 — 10%	Feb 14	Mar 6 - 10:59pm
Assignment 4 — 10%	Feb 22	Mar 13 - 10:59pm
Assignment 5 — 10%	Mar 14	Mar 27 - 10:59pm
Project	Feb 22	Mar 30, Apr 1 - 6pm