Lecture 1: Logistics and introduction
Definition:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
- Experience E (data): games played by the program or human
- Performance measure P: winning rate
- Task T: to win
Taxonomy of Machine Learning (A Simplistic View)
1. Supervised Learning
- Core idea: Learns from labeled data.
- Example Tasks:
- Regression: The prediction result is a continuous variable.
- e.g., price prediction
- (x = Area) โ (y = Price)?
- Classification: The prediction result is a discrete variable.
- e.g., type prediction
- (x = Area, y = Price) โ (z = Type)?
- Regression: The prediction result is a continuous variable.
2. Unsupervised Learning
- Core idea: Learns from unlabeled data.
- Example Tasks:
- Clustering:
- Given a dataset containing n samples:
(xโฝยนโพ, yโฝยนโพ), (xโฝยฒโพ, yโฝยฒโพ), (xโฝยณโพ, yโฝยณโพ), ..., (xโฝโฟโพ, yโฝโฟโพ) - Task (vague): find interesting structures in the data.
- Given a dataset containing n samples:
- Clustering:
3. Semi-supervised Learning
- Core idea: Learns from a mix of labeled and unlabeled data.
4. Reinforcement Learning
- Core idea: Learns from environment feedback (rewards/penalties).
- Example Task: Multi-armed bandit problem (MAB)
- it involves a feedback loop between an Agent and an Environment:
- The Agent takes an Action (e.g., pull an arm).
- The Environment returns Feedback (e.g., a corresponding reward).
- The agent learns from this feedback to make better decisions in the future.
- it involves a feedback loop between an Agent and an Environment:
Learning Modes (When do we collect data?)
- Offline learning: The model is trained on a static dataset before deployment.
- Online learning: The model is trained incrementally as new data becomes available.
Mathematical Tools
Parameter Estimation: Maximum Likelihood Estimation (MLE)
A foundational method for estimating the parameters of a statistical model from data.
-
Core Principle: Find the parameter values (
) that maximize the likelihood function. in other words, we find the parameters that make the observed data most probable.
-
Example: MLE for Bernoulli Distribution
- Scenario: We have a dataset
from a Bernoulli distribution (e.g., coin flips), where is 1 (heads) or 0 (tails). - Goal: Estimate the probability of heads,
. - Derivation Steps:
- Likelihood Function: Assuming data points are independent and identically distributed (i.i.d.), the likelihood of observing the entire dataset is the product of individual probabilities:
- Log-Likelihood Function: To simplify calculation (turning products into sums) and for numerical stability, we maximize the log-likelihood, which is equivalent.
- Substitute Bernoulli PMF: The probability mass function for a single
can be written as . its logarithm is . The log-likelihood becomes:
- Simplify: Let
be the total count of 1s (heads). The total count of 0s is . The expression simplifies to:
- Likelihood Function: Assuming data points are independent and identically distributed (i.i.d.), the likelihood of observing the entire dataset is the product of individual probabilities:
- Result: To find the maximum, we take the derivative of this final expression with respect to
, set it to 0, and solve. The resulting estimate is highly intuitive:
(The proportion of 1s in the data)
- Scenario: We have a dataset
Lecture 3: Supervised Learning: Regression and Classification II
Ridge Regression: Study Notes
1. Core Idea & Philosophy
Ridge Regression is an enhancement of Ordinary Least Squares (OLS). Its core idea is to address the problem of overfitting by sacrificing a small amount of bias to achieve a significant reduction in model variance.
Analogy:
- OLS is like a "novice detective" who tries to create a complex theory that perfectly explains 100% of the current evidence. This theory is often fragile and performs poorly on new cases (high variance).
- Ridge Regression is like a "veteran detective" who knows that evidence contains noise and coincidences. He seeks a simpler, more general theory that might not explain every tiny detail perfectly but is more robust and predictive for new cases (low variance).
2. Motivation: Why is Ridge Regression Necessary?
Ridge Regression is primarily designed to solve critical failures of OLS that occur in a specific, common scenario.
The Problem Scenario: High-Dimension, Low-Sample Data (d >> m)
- The number of features
is much larger than the number of samples . - This is common in modern datasets like genetics, finance, and text analysis.
Problems for OLS in this Scenario:
A. Conceptual Level: Severe Overfitting
- With more features than samples, the model has enough flexibility to "memorize" the training data, including its noise and random fluctuations.
- This results in learned model weights (
) that are absurdly large, assigning huge importance to potentially irrelevant features. - The model performs perfectly on training data but fails to generalize to unseen test data.
B. Mathematical Level: The Matrix
- The analytical solution for OLS is:
. - This solution requires the matrix
to be invertible. is often non-invertible or ill-conditioned (nearly non-invertible) in two cases: - When
(Fewer Samples than Features): This is the main reason. By linear algebra, . If , the rank of is less than its dimension ( ), meaning it is not full rank and is therefore guaranteed to be singular (non-invertible). - Multicollinearity: When features are highly correlated (e.g., including both "house size in sq. meters" and "house size in sq. feet"). This makes the columns of
linearly dependent, which in turn makes singular.
- When
3. The Solution: How Ridge Regression Works
Ridge Regression introduces a penalty term into the objective function to constrain the size of the model's weights.
Step 1: Modify the Objective Function
-
Ordinary Least Squares (OLS) Objective:
- Minimize the Sum of Squared Errors (SSE)
-
Ridge Regression Objective:
- Minimize SSE + ฮป * Penalty on Model Complexity
-
Dissecting the Penalty Term:
: This is the squared L2-Norm of the weight vector. It is the sum of the squares of all weights. A large value implies a complex model with large weights. (Lambda): The regularization parameter. This is a hyperparameter that we set to control the strength of the penalty. - Large
: Strong penalty. The model is forced to shrink the weights towards zero to avoid a large penalty. - Small
: Weak penalty. The model behaves more like OLS. : No penalty. Ridge Regression becomes identical to OLS.
- Large
Step 2: Derive the New Analytical Solution
By taking the derivative of the new objective function with respect to
is the Identity Matrix. - Compared to the OLS solution, the only difference is the addition of the
term. This single term is what solves the non-invertibility problem.
4. The Core Insight: Why is the "Special Medicine"
This term works by altering the eigenvalues of the matrix to guarantee its invertibility.
-
Invertibility and Eigenvalues: A matrix is singular (non-invertible) if and only if it has at least one eigenvalue that is 0.
-
Eigenvalues of
: is a positive semi-definite matrix, which means all its eigenvalues are non-negative ( ). When it's singular, it has at least one eigenvalue . -
The Effect of
: When we add to , the eigenvalues of the new matrix become . -
The Result:
- We choose
to be a small positive number ( ). - The original eigenvalues were
. - The new eigenvalues are
. - Therefore, all new eigenvalues are strictly positive (
). - A matrix whose eigenvalues are all greater than zero is guaranteed to be invertible.
- We choose
Conclusion: The
5. Summary & Comparison
Aspect | Ordinary Least Squares (OLS) | Ridge Regression |
---|---|---|
Objective Function | Minimize SSE | Minimize [SSE + ฮป * L2-Norm of weights] |
Model Complexity | Unconstrained, weights can be very large | Constrained by L2 penalty, forcing weights to be smaller |
Handling |
||
Analytical Solution | ||
Key Property | Unbiased estimate, but can have very high variance | Biased estimate, but with significantly lower variance |
Best Use Case | Low-dimensional data, no multicollinearity | High-dimensional data (esp. |