Lecture 1: Logistics and introduction

Definition:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Taxonomy of Machine Learning (A Simplistic View)

1. Supervised Learning

2. Unsupervised Learning

3. Semi-supervised Learning

4. Reinforcement Learning

Learning Modes (When do we collect data?)

Mathematical Tools

Mathematics for Ai

Mathematics for Ai (Chinese)

Parameter Estimation: Maximum Likelihood Estimation (MLE)

A foundational method for estimating the parameters of a statistical model from data.

Lecture 3: Supervised Learning: Regression and Classification II

Ridge Regression: Study Notes

1. Core Idea & Philosophy

Ridge Regression is an enhancement of Ordinary Least Squares (OLS). Its core idea is to address the problem of overfitting by sacrificing a small amount of bias to achieve a significant reduction in model variance.

Analogy:


2. Motivation: Why is Ridge Regression Necessary?

Ridge Regression is primarily designed to solve critical failures of OLS that occur in a specific, common scenario.

The Problem Scenario: High-Dimension, Low-Sample Data (d >> m)

Problems for OLS in this Scenario:

A. Conceptual Level: Severe Overfitting

B. Mathematical Level: The Matrix Xแต€X is Non-Invertible (Singular)


3. The Solution: How Ridge Regression Works

Ridge Regression introduces a penalty term into the objective function to constrain the size of the model's weights.

Step 1: Modify the Objective Function

Step 2: Derive the New Analytical Solution

By taking the derivative of the new objective function with respect to w and setting it to zero, we get the analytical solution for Ridge Regression:

wridgeโˆ—=(Xแต€X+ฮปI)โปยนXแต€y


4. The Core Insight: Why +ฮปI is the "Special Medicine"

This term works by altering the eigenvalues of the matrix to guarantee its invertibility.

  1. Invertibility and Eigenvalues: A matrix is singular (non-invertible) if and only if it has at least one eigenvalue that is 0.

  2. Eigenvalues of Xแต€X: Xแต€X is a positive semi-definite matrix, which means all its eigenvalues ฮผแตข are non-negative (ฮผแตขโ‰ฅ0). When it's singular, it has at least one eigenvalue ฮผแตข=0.

  3. The Effect of +ฮปI: When we add ฮปI to Xแต€X, the eigenvalues of the new matrix (Xแต€X+ฮปI) become (ฮผแตข+ฮป).

  4. The Result:

    • We choose ฮป to be a small positive number (ฮป>0).
    • The original eigenvalues were ฮผแตขโ‰ฅ0.
    • The new eigenvalues are ฮผแตข+ฮป.
    • Therefore, all new eigenvalues are strictly positive (>0).
    • A matrix whose eigenvalues are all greater than zero is guaranteed to be invertible.

Conclusion: The ฮปI term acts as a "stabilizer" by shifting all eigenvalues of Xแต€X up by a positive amount ฮป, ensuring that none are zero and thus making the matrix (Xแต€X+ฮปI) invertible.


5. Summary & Comparison

Aspect Ordinary Least Squares (OLS) Ridge Regression
Objective Function Minimize SSE Minimize [SSE + ฮป * L2-Norm of weights]
Model Complexity Unconstrained, weights can be very large Constrained by L2 penalty, forcing weights to be smaller
Handling d>m Xแต€X is singular; no stable solution exists (Xแต€X+ฮปI) is always invertible; provides a stable solution
Analytical Solution (Xแต€X)โปยนXแต€y (Xแต€X+ฮปI)โปยนXแต€y
Key Property Unbiased estimate, but can have very high variance Biased estimate, but with significantly lower variance
Best Use Case Low-dimensional data, no multicollinearity High-dimensional data (esp. d>m), multicollinearity is present

6. How to Choose ฮป?

ฮป is a critical hyperparameter that controls the trade-off between bias and variance. It is not learned from the training data. The optimal value of ฮป is typically found using Cross-Validation, a technique that evaluates the model's performance on unseen data for different ฮป values and selects the one that generalizes best.