Lecture 1: Logistics and introduction

Definition:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Taxonomy of Machine Learning (A Simplistic View)

1. Supervised Learning

2. Unsupervised Learning

3. Semi-supervised Learning

4. Reinforcement Learning

Learning Modes (When do we collect data?)

Mathematical Tools

Mathematics for Ai

Mathematics for Ai (Chinese)

Parameter Estimation: Maximum Likelihood Estimation (MLE)

A foundational method for estimating the parameters of a statistical model from data.

Lecture 3: Supervised Learning: Regression and Classification II

Ridge Regression: Study Notes

1. Core Idea & Philosophy

Ridge Regression is an enhancement of Ordinary Least Squares (OLS). Its core idea is to address the problem of overfitting by sacrificing a small amount of bias to achieve a significant reduction in model variance.

Analogy:


2. Motivation: Why is Ridge Regression Necessary?

Ridge Regression is primarily designed to solve critical failures of OLS that occur in a specific, common scenario.

The Problem Scenario: High-Dimension, Low-Sample Data (d >> m)

Problems for OLS in this Scenario:

A. Conceptual Level: Severe Overfitting

B. Mathematical Level: The Matrix XX is Non-Invertible (Singular)


3. The Solution: How Ridge Regression Works

Ridge Regression introduces a penalty term into the objective function to constrain the size of the model's weights.

Step 1: Modify the Objective Function

Step 2: Derive the New Analytical Solution

By taking the derivative of the new objective function with respect to w and setting it to zero, we get the analytical solution for Ridge Regression:

wridge=(XX+λI)¹Xy


4. The Core Insight: Why +λI is the "Special Medicine"

This term works by altering the eigenvalues of the matrix to guarantee its invertibility.

  1. Invertibility and Eigenvalues: A matrix is singular (non-invertible) if and only if it has at least one eigenvalue that is 0.

  2. Eigenvalues of XX: XX is a positive semi-definite matrix, which means all its eigenvalues μ are non-negative (μ0). When it's singular, it has at least one eigenvalue μ=0.

  3. The Effect of +λI: When we add λI to XX, the eigenvalues of the new matrix (XX+λI) become (μ+λ).

  4. The Result:

    • We choose λ to be a small positive number (λ>0).
    • The original eigenvalues were μ0.
    • The new eigenvalues are μ+λ.
    • Therefore, all new eigenvalues are strictly positive (>0).
    • A matrix whose eigenvalues are all greater than zero is guaranteed to be invertible.

Conclusion: The λI term acts as a "stabilizer" by shifting all eigenvalues of XX up by a positive amount λ, ensuring that none are zero and thus making the matrix (XX+λI) invertible.


5. Summary & Comparison

Aspect Ordinary Least Squares (OLS) Ridge Regression
Objective Function Minimize SSE Minimize [SSE + λ * L2-Norm of weights]
Model Complexity Unconstrained, weights can be very large Constrained by L2 penalty, forcing weights to be smaller
Handling d>m XX is singular; no stable solution exists (XX+λI) is always invertible; provides a stable solution
Analytical Solution (XX)¹Xy (XX+λI)¹Xy
Key Property Unbiased estimate, but can have very high variance Biased estimate, but with significantly lower variance
Best Use Case Low-dimensional data, no multicollinearity High-dimensional data (esp. d>m), multicollinearity is present

6. How to Choose λ?

λ is a critical hyperparameter that controls the trade-off between bias and variance. It is not learned from the training data. The optimal value of λ is typically found using Cross-Validation, a technique that evaluates the model's performance on unseen data for different λ values and selects the one that generalizes best.

Lecture 5: Supervised Learning: Regression and Classification IV

梯度下降 (Gradient Descent - GD) 算法:动机

梯度下降 (GD) 算法:基础

Adagrad (Adaptive Gradient Algorithm)

动量法 (Momentum-based GD)

Nesterov 加速梯度 (NAG)


分类器的间隔 (Margin of Classifier)

线性可分 (Linearly Separable)

γ-线性可分 (γ-linearly separable)

几何间隔 (Geometric Margin)

最大间隔线性分类器

Primal-SVM:唯一解证明

软间隔 SVM (Soft Margin SVM)