csc490

CSC 490 Winter 2026

ML Engineering Capstone

This hands-on course will combine ML engineering theory with hands on building ML and AI systems. Topics include AI product management, Cloud Infrastructure, Model Training, Model Inference and ML Systems Design.

Throughout the semester there will be guest lectures to provide industry insight on topics covered in class. The class will culminate in a final presentation on a research or product based project.

More details can be found in syllabus and piazza.


Announcements:


Instructors:

Instructor Denys Linkov
Email csc490-2026-01@cs.toronto.edu
Office hours By Appointment

Teaching Assistants:


Time & Location:

Section Lecture Tutorial
CSC490H1-W-LEC5101 M 6-9pm @ DSCIL N/A
CSC490H1-W-LEC5201 W 6-9pm @ DSCIL N/A

Suggested Reading

No required textbooks. Suggested reading will be posted after each lecture (See lectures below).


Lectures and timeline

Week Lectures Suggested reading Tutorials Guest Lectures  
1 The AI Landscape and AI product management Strategyzer — The Value Proposition Canvas

Brown et al. — Language Models Are Few‑Shot Learners (2020)

Dettmers — bitsandbytes: 8‑bit optimizers / quantization (2023)

Radford et al. — Learning Transferable Visual Models from Natural Language Supervision (2021)

Ambrosio — Achieving Human Level Competitive Robot Table Tennis (2024)

Xia — Pubsub latency comparison article (2021)

Doshi — Good Product Managers, Great (2020)

Rachitsky — Product Management: Startup vs Big Company(2020)

First Round Review — How to craft your product team at every stage
  Ivan Zhang - Cohere  
2 Intro to Docker, Kubernetes, cloud, Terraform and architecture diagrams Tokenizer Example

Sebastian Raschka - From GPT-2 to gpt-oss: Analyzing the Architectural Advances (2025)

Kubernetes Architecture

Chen et al. - EXTENDING CONTEXT WINDOW OF LARGE LANGUAGE MODELS VIA POSITION INTERPOLATION (2023)

DZone Feature Flags (2015)

DHH - Why we left the cloud (2023)
Kubernetes Tutorial Jeff Wang - Windsurf/Cognition

Kashaf Salaheen - Hashicorp

Paul Richardson - Eng Leader
 
3 Evaluating ML products     Michael Jodha - RBC  
4 Prompting and Constrained decoding

CSC412MOELecture

CSC412Website
Sutskever et al. — Sequence to Sequence Learning with Neural Networks (2014)

McCann et al. — The Natural Language Decathlon: Multitask Learning as Question Answering (2018)

Radford et al. — Language Models are Unsupervised Multitask Learners (2019)

Raffel et al. — Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer (2019)

Lewis et al. — Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks (2020)

He et al. — Defeating Nondeterminism in LLM Inference (2025)

Beurer-Kellner et al. — Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation (2024)

Snell et al. — Scaling LLM Test‑Time Compute Optimally can be More Effective than Scaling Model Parameters (2024)

Wallace et al. — The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (2024)
Constrained decoding Colab Debangshu Banerjee - UIUC  
5 Model serving deep dive 1 - LORA and Model Serving Flash Attention - (Dao, 2022)

Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference - (Recasens, 2025)

NVIDIA Hopper Architecture In-Depth - (Nvidia, 2022)

Optimizing BERT Inference

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents (AWS, 2020)

Set fit batch sizes (Huggingface)

Parralelism Forms (Nvidia,2023)

LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. (Hu,2021)

Host concurrent LLMs with LoRAX - (AWS,2025)
VLLM Serving tutorial    
6 Feature stores and Evaluation metrics What is a feature store - Tecton

Just Use Postgres for Everything - (Stephan Schmidt, 2025)

Offline to Online: Feature Storage for Real-time Recommendation Systems with NVIDIA Merlin - (Partee,2023)

Offline to Online: Feature Storage for Real-time Recommendation Systems with NVIDIA Merlin - (Partee,2023)

Optimal Feature Discovery: Better, Leaner Machine Learning Models Through Information Theory - (Uber, 2021)

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix - (Netflix, 2025)

Google ML Glossery
Eval tutorial Susan Shu Chang - Elastic  
7 Reading week
(No Class/Tutorial)
       
8 Model training - Pre-training, SFT Hoffmann et al. — Training Compute-Optimal Large Language Models (2022)

Grattafiori et al. — The Llama 3 Herd of Models (2024)

Raffel et al. — Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019)

Goyal et al. — Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (2018)

Chowdhery et al. — PaLM: Scaling Language Modeling with Pathways (2022)

Micikevicius et al. — Mixed Precision Training (2018)

NVIDIA — Pretraining Large Language Models with NVFP4 (2025)

Wang et al. — Text Embeddings by Weakly-Supervised Contrastive Pre-training (2024)

Vera et al. — EmbeddingGemma: Powerful and Lightweight Text Representations (2025)

Schroff et al. — FaceNet: A unified embedding for face recognition and clustering (2015)

Mattson et al. — MLPerf Training Benchmark (2020)

Allal et al. — SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model (2025)
Nanochat Tutorial Carlos Arguelles - Amazon

Cameron R. Wolfe - Netflix
 
9 Model training - Reinforcement learning and Prompting, PPO, GRPO, RLHF Schulman et al. — Proximal Policy Optimization Algorithms (2017)

Ziegler et al. — Fine-Tuning Language Models from Human Preferences (2019)

Ouyang et al. — Training language models to follow instructions with human feedback (InstructGPT) (2022)

Rafailov et al. — Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023)

Guo et al. — DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024)

Guo et al. — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)

Fu et al. — AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning (2025)

Kimi Team — Kimi K2: Open Agentic Intelligence (2025)

Piché et al. — PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation (2025)

Khatri et al. — The Art of Scaling Reinforcement Learning Compute for LLMs (2025)

Ling Team — Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model (2025)

Alibaba — Qwen 3.5: Towards Native Multimodal Agents (2026)

Abouzaid et al. — First Proof (2026)

Sebastian Raschka — LLM Research Papers 2025 List (2025)
  Will Brown - Prime Intellect  
10 Scalability of ML Systems AWS — The Difference Between ETL and ELT

Crunchy Data — Parquet and Postgres in the Data Lake

Bengio et al. — Curriculum Learning (2009)

Sachdeva et al. — Data-Juicer: A One-Stop Data Processing System for Large Language Models (2024)

Liu et al. — RegMix: Mixture-of-Experts for Data-Efficient Language Model Pre-training (2024)

Allal et al. — SmolLM2: Scorey, Sassy, and Smol (2024)

Penedo et al. — The FineWeb Datasets: Decanting the Web for the Finest 15T Tokens (2024)

Cormack et al. — Reciprocal Rank Fusion out-performs Condorcet and individual Rank Learning Methods (2009)

Thakur et al. — BEIR: A Heterogeneous Benchmark for Information Retrieval (2021)

Muennighoff et al. — MTEB: Massive Text Embedding Benchmark (2023)

Enevoldsen et al. — MTEB 2: The Next Generation of Text Embedding Evaluation (2025)
  Tutorial on MoE and Expert Parralelism Marcel Kornacker - Pixatable

Tyler Han - Voiceflow
11 Model serving deep dive 2 - Speculative decoding & KV caching, Quantization and CUDA Sebastian Raschka — Coding the KV Cache in LLMs (2024)

Xiao et al. — SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (2022)

NVIDIA — Mastering LLM Techniques: Inference Optimization (2023)

Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) (2023)

Zhong et al. — DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving (2024)

Hu et al. — Inference without Interference: Disaggregate LLM Serving for Higher Throughput and Lower Latency (2024)

Leviathan et al. — Fast Inference from Transformers via Speculative Decoding (2023)

Cai et al. — Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads (2024)

Bachmann et al. — Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment (2025)

Li et al. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

OpenAI — Introducing gpt-oss: Open-weight Reasoning Models (2025)

Dettmers et al. — LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022)

Maarten Grootendorst — A Visual Guide to Quantization (2024)

NVIDIA — Optimizing LLMs for Performance and Accuracy with Post-Training Quantization (2023)

NVIDIA — Pretraining Large Language Models with NVFP4 (2025)

Liu et al. — LLM-QAT: Data-Free Quantization Aware Training for Large Language Models (2023)

Dettmers et al. — QLoRA: Efficient Finetuning of Quantized LLMs (2023)

NVIDIA — Blackwell InferenceMax Benchmark Results (2025)

NVIDIA — CUDA C Programming Guide

Modal — GPU Glossary

NVIDIA — NVIDIA Hopper Architecture Whitepaper (2022)

NASA — Basics on NVIDIA GPU Hardware Architecture
Speculative Decoding Tutorial Adil Asif - Nvidia

Chris Smith - AMD
 
12 Search and Recommender systems     Devansh Tandon - Meta  
13 Final Presentations        

Assignments

Assignment # Out Due
Assignment 1 — 10% Jan 5 Jan 23 - 10:59pm
Assignment 2 — 10% Jan 24 Feb 20 - 10:59pm
Assignment 3 — 10% Feb 14 Mar 6 - 10:59pm
Assignment 4 — 10% Feb 22 Mar 13 - 10:59pm
Assignment 5 — 10% Mar 14 Mar 27 - 10:59pm
Project Feb 22 Mar 30, Apr 1 - 6pm