Lecture 1: Logistics and introduction

Definition:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Experience E (data): games played by the program or human
Performance measure P: winning rate
Task T: to win

Taxonomy of Machine Learning (A Simplistic View)

1. Supervised Learning

Core idea: Learns from labeled data.
Example Tasks:
- Regression: The prediction result is a continuous variable.
  - e.g., price prediction
  - (x = Area) → (y = Price)?
- Classification: The prediction result is a discrete variable.
  - e.g., type prediction
  - (x = Area, y = Price) → (z = Type)?

2. Unsupervised Learning

Core idea: Learns from unlabeled data.
Example Tasks:
- Clustering:
  - Given a dataset containing n samples:
    (x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), (x⁽³⁾, y⁽³⁾), ..., (x⁽ⁿ⁾, y⁽ⁿ⁾)
  - Task (vague): find interesting structures in the data.

3. Semi-supervised Learning

Core idea: Learns from a mix of labeled and unlabeled data.

4. Reinforcement Learning

Core idea: Learns from environment feedback (rewards/penalties).
Example Task: Multi-armed bandit problem (MAB)
- it involves a feedback loop between an Agent and an Environment:
  1. The Agent takes an Action (e.g., pull an arm).
  2. The Environment returns Feedback (e.g., a corresponding reward).
  3. The agent learns from this feedback to make better decisions in the future.

Learning Modes (When do we collect data?)

Offline learning: The model is trained on a static dataset before deployment.
Online learning: The model is trained incrementally as new data becomes available.

Mathematical Tools

Mathematics for Ai

Mathematics for Ai (Chinese)

Parameter Estimation: Maximum Likelihood Estimation (MLE)

A foundational method for estimating the parameters of a statistical model from data.

Core Principle: Find the parameter values ( $θ$ ) that maximize the likelihood function. in other words, we find the parameters that make the observed data most probable.
$\hat{θ_{M L}} = \arg max_{θ} P (D a t a | θ)$
Example: MLE for Bernoulli Distribution
- Scenario: We have a dataset $D = X ₁, . . ., X ₘ$ from a Bernoulli distribution (e.g., $m$ coin flips), where $X_{i}$ is 1 (heads) or 0 (tails).
- Goal: Estimate the probability of heads, $θ = P (X = 1)$ .
- Derivation Steps:
  1. Likelihood Function: Assuming data points are independent and identically distributed (i.i.d.), the likelihood of observing the entire dataset is the product of individual probabilities:
    $L (θ) = \prod_{i} p (X_{i}; θ)$
  2. Log-Likelihood Function: To simplify calculation (turning products into sums) and for numerical stability, we maximize the log-likelihood, which is equivalent.
    $\log L (θ) = \sum_{i} \log p (X_{i}; θ)$
  3. Substitute Bernoulli PMF: The probability mass function for a single $X i$ can be written as $p (X_{i}; θ) = θ (X_{i}) * (1 - θ) (1 - X_{i})$ . its logarithm is $\log p (X i; θ) = X i \log (θ) + (1 - X_{i}) \log (1 - θ)$ . The log-likelihood becomes:
    $\log L (θ) = \sum_{i} [X_{i} l o g (θ) + (1 - X_{i}) l o g (1 - θ)]$
  4. Simplify: Let $M ₁ = \sum_{i} X_{i}$ be the total count of 1s (heads). The total count of 0s is $m - M ₁$ . The expression simplifies to:
    $\log L (θ) = M ₁ \log (θ) + (m - M ₁) \log (1 - θ)$
- Result: To find the maximum, we take the derivative of this final expression with respect to $θ$ , set it to 0, and solve. The resulting estimate is highly intuitive:
  $\hat{θ_{M L}} = \frac{M_{1}}{m}$ (The proportion of 1s in the data)

Lecture 3: Supervised Learning: Regression and Classification II

Ridge Regression: Study Notes

1. Core Idea & Philosophy

Ridge Regression is an enhancement of Ordinary Least Squares (OLS). Its core idea is to address the problem of overfitting by sacrificing a small amount of bias to achieve a significant reduction in model variance.

Analogy:

OLS is like a "novice detective" who tries to create a complex theory that perfectly explains 100% of the current evidence. This theory is often fragile and performs poorly on new cases (high variance).
Ridge Regression is like a "veteran detective" who knows that evidence contains noise and coincidences. He seeks a simpler, more general theory that might not explain every tiny detail perfectly but is more robust and predictive for new cases (low variance).

2. Motivation: Why is Ridge Regression Necessary?

Ridge Regression is primarily designed to solve critical failures of OLS that occur in a specific, common scenario.

The Problem Scenario: High-Dimension, Low-Sample Data (d >> m)

The number of features $d$ is much larger than the number of samples $m$ .
This is common in modern datasets like genetics, finance, and text analysis.

Problems for OLS in this Scenario:

A. Conceptual Level: Severe Overfitting

With more features than samples, the model has enough flexibility to "memorize" the training data, including its noise and random fluctuations.
This results in learned model weights ( $w$ ) that are absurdly large, assigning huge importance to potentially irrelevant features.
The model performs perfectly on training data but fails to generalize to unseen test data.

B. Mathematical Level: The Matrix $X ᵀ X$ is Non-Invertible (Singular)

The analytical solution for OLS is: $w * = (X ᵀ X) ⁻ ¹ X ᵀ y$ .
This solution requires the matrix $X ᵀ X$ to be invertible.
$X ᵀ X$ is often non-invertible or ill-conditioned (nearly non-invertible) in two cases:
1. When $m < d + 1$ (Fewer Samples than Features): This is the main reason. By linear algebra, $r a n k (X ᵀ X) = r a n k (X) \leq m i n (m, d + 1)$ . If $m < d + 1$ , the rank of $X ᵀ X$ is less than its dimension ( $d + 1$ ), meaning it is not full rank and is therefore guaranteed to be singular (non-invertible).
2. Multicollinearity: When features are highly correlated (e.g., including both "house size in sq. meters" and "house size in sq. feet"). This makes the columns of $X$ linearly dependent, which in turn makes $X ᵀ X$ singular.

3. The Solution: How Ridge Regression Works

Ridge Regression introduces a penalty term into the objective function to constrain the size of the model's weights.

Step 1: Modify the Objective Function

Ordinary Least Squares (OLS) Objective:
- Minimize the Sum of Squared Errors (SSE)
- $J_{O L S} (w) = | | y - X w | | ₂ ²$
Ridge Regression Objective:
- Minimize SSE + λ * Penalty on Model Complexity
- $J_{R i d g e} (w) = | | y - X w | | ₂ ² + λ | | w | | ₂ ²$
Dissecting the Penalty Term:
- $| | w | | ₂ ² = Σ (w ᵢ) ²$ : This is the squared L2-Norm of the weight vector. It is the sum of the squares of all weights. A large value implies a complex model with large weights.
- $λ$ (Lambda): The regularization parameter. This is a hyperparameter that we set to control the strength of the penalty.
  - Large $λ$ : Strong penalty. The model is forced to shrink the weights towards zero to avoid a large penalty.
  - Small $λ$ : Weak penalty. The model behaves more like OLS.
  - $λ = 0$ : No penalty. Ridge Regression becomes identical to OLS.

Step 2: Derive the New Analytical Solution

By taking the derivative of the new objective function with respect to $w$ and setting it to zero, we get the analytical solution for Ridge Regression:

$w_{r i d g e}^{*} = (X ᵀ X + λ I) ⁻ ¹ X ᵀ y$

$I$ is the Identity Matrix.
Compared to the OLS solution, the only difference is the addition of the $+ λ I$ term. This single term is what solves the non-invertibility problem.

4. The Core Insight: Why $+ λ I$ is the "Special Medicine"

This term works by altering the eigenvalues of the matrix to guarantee its invertibility.

Invertibility and Eigenvalues: A matrix is singular (non-invertible) if and only if it has at least one eigenvalue that is 0.
Eigenvalues of $X ᵀ X$ : $X ᵀ X$ is a positive semi-definite matrix, which means all its eigenvalues $μ ᵢ$ are non-negative ( $μ ᵢ \geq 0$ ). When it's singular, it has at least one eigenvalue $μ ᵢ = 0$ .
The Effect of $+ λ I$ : When we add $λ I$ to $X ᵀ X$ , the eigenvalues of the new matrix $(X ᵀ X + λ I)$ become $(μ ᵢ + λ)$ .
The Result:
- We choose $λ$ to be a small positive number ( $λ > 0$ ).
- The original eigenvalues were $μ ᵢ \geq 0$ .
- The new eigenvalues are $μ ᵢ + λ$ .
- Therefore, all new eigenvalues are strictly positive ( $> 0$ ).
- A matrix whose eigenvalues are all greater than zero is guaranteed to be invertible.

Conclusion: The $λ I$ term acts as a "stabilizer" by shifting all eigenvalues of $X ᵀ X$ up by a positive amount $λ$ , ensuring that none are zero and thus making the matrix $(X ᵀ X + λ I)$ invertible.

5. Summary & Comparison

Aspect	Ordinary Least Squares (OLS)	Ridge Regression
Objective Function	Minimize SSE	Minimize [SSE + λ * L2-Norm of weights]
Model Complexity	Unconstrained, weights can be very large	Constrained by L2 penalty, forcing weights to be smaller
Handling $d > m$	$X ᵀ X$ is singular; no stable solution exists	$(X ᵀ X + λ I)$ is always invertible; provides a stable solution
Analytical Solution	$(X ᵀ X) ⁻ ¹ X ᵀ y$	$(X ᵀ X + λ I) ⁻ ¹ X ᵀ y$
Key Property	Unbiased estimate, but can have very high variance	Biased estimate, but with significantly lower variance
Best Use Case	Low-dimensional data, no multicollinearity	High-dimensional data (esp. $d > m$ ), multicollinearity is present

6. How to Choose $λ$ ?

$λ$ is a critical hyperparameter that controls the trade-off between bias and variance. It is not learned from the training data. The optimal value of $λ$ is typically found using Cross-Validation, a technique that evaluates the model's performance on unseen data for different $λ$ values and selects the one that generalizes best.

Lecture 5: Supervised Learning: Regression and Classification IV

梯度下降 (Gradient Descent - GD) 算法：动机

核心目标: 在机器学习中，我们通常会定义一个成本函数 (Cost Function) $C (w)$ ，它衡量了模型在当前参数 $w$ 下的“糟糕程度”。我们的目标就是找到一组参数 $w$ 来最小化这个函数。这里的 $w$ 是一个包含所有模型参数（如权重）的向量， $w = [w_{1}, w_{2}, \dots, w_{d}]^{⊤}$ 。
梯度的角色:
为了找到最小值，我们需要知道应该朝哪个方向调整参数。梯度 (Gradient) 就是这个问题的答案。函数 $C (w)$ 在某一点的梯度，记作 $\nabla_{w} C (w)$ ，是一个向量，它指向该点函数值上升最快的方向。

这个梯度向量由函数对每个参数的偏导数构成：
$\nabla_{w} C (w) = (\begin{matrix} \frac{\partial C}{\partial w_{1}} \\ \frac{\partial C}{\partial w_{2}} \\ ⋮ \\ \frac{\partial C}{\partial w_{d}} \end{matrix})$
下降最快的方向:
既然梯度指向“上坡”最陡的方向，那么它的反方向，即负梯度 $- \nabla_{w} C (w)$ ，就必然指向“下坡”最陡的方向。因此，如果我们想让函数值 $C (w)$ 减小得最快，就应该沿着负梯度的方向去更新参数 $w$ 。这就是梯度下降算法的核心思想。

梯度下降 (GD) 算法：基础

算法流程:
梯度下降是一个迭代过程。我们从一个随机的初始参数点 $w_{0}$ 开始，然后反复执行一个简单的更新步骤，直到满足某个停止条件。
核心更新公式:
每一步迭代都遵循以下公式：
$w_{k + 1} \leftarrow w_{k} - η \nabla_{w} C (w_{k})$
- $w_{k}$ 是参数在第 $k$ 次迭代时的位置。
- $w_{k + 1}$ 是参数更新后的新位置。
- $\nabla_{w} C (w_{k})$ 是在当前位置 $w_{k}$ 计算出的梯度。
- $η$ (eta) 是学习率 (Learning Rate)，一个正的小常数，它决定了我们每一步“走多远”。
理论保证:
根据微积分原理，只要学习率 $η$ 被设定在一个合理的范围内（不能太大），那么每执行一次更新，成本函数的值都会减小： $C (w_{k + 1}) < C (w_{k})$ 。通过不断重复这个过程，我们就能逐步逼近函数的（局部）最小值。
收敛准则 (Convergence Criteria):
我们不能让算法无限地运行下去，必须定义一个停止条件。常见的停止条件有：
1. 最大迭代次数: 预先设定一个迭代上限，例如1000次，达到后强制停止。
2. 目标函数变化阈值: 当函数值的下降变得非常缓慢，例如 $| C (w_{k + 1}) - C (w_{k}) |$ 小于一个很小的数 $ϵ$ ，就认为已经收敛。
3. 参数变化阈值: 当参数向量本身几乎不再变化，例如 $∥ w_{k + 1} - w_{k} ∥$ 小于一个很小的数 $ϵ$ ，也认为已经收敛。
局限性：局部最小值:
梯度下降算法的一个关键特性是它只会沿着下坡方向走。如果一个函数的图像有多个山谷（多个局部最小值），梯度下降只能保证找到它出发点所在的那个山谷的底部，而不能保证找到全局最低的那个山谷。一旦到达任何一个梯度为零的点（局部最小值或鞍点）， $\nabla_{w} C (w)$ 变为零向量，更新就会停止。
学习率 $η$ 的影响:
- $η$ 太小: 更新步长太小，就像小碎步下山，虽然稳妥但速度极慢。
- $η$ 太大: 更新步长太大，可能会在山谷底部来回“跨越”，导致震荡而无法精确到达最低点，甚至可能一步跨到对面更高的山坡上，导致函数值不降反升，即发散。

Adagrad (Adaptive Gradient Algorithm)

动机:
标准梯度下降对所有参数都使用同一个学习率 $η$ 。但在实际问题中，不同参数的重要性可能不同，有些参数可能需要更快的更新，有些则需要更慢、更稳健的更新。Adagrad 旨在为每一个参数自动地、自适应地调整其学习率。
核心思想:
对于那些在训练过程中梯度一直很大的参数（意味着它们被频繁、大幅度地更新），我们应该对其保持谨慎，减小其学习率；反之，对于梯度一直很小的参数，我们应该给予其更大的学习率以鼓励其更新。
更新公式:
Adagrad 对标准GD公式中的学习率 $η$ 做了一个修正：
$w_{k + 1, i} \leftarrow w_{k, i} - \frac{η}{\sqrt{M_{i i, k} + ϵ}} [\nabla_{w} C (w_{k})]_{i}$
- 这个更新是针对第 $i$ 个参数 $w_{i}$ 的。
- $[\nabla_{w} C (w_{k})]_{i}$ 是梯度向量的第 $i$ 个分量。
- $M_{i i, k}$ 是一个关键项，它累积了第 $i$ 个参数从训练开始到第 $k$ 步所有历史梯度的平方和： $M_{i i, k} = \sum_{s = 0}^{k} ([\nabla_{w} C (w_{s})]_{i})^{2}$
- $ϵ$ 是一个极小的正数（如 $10^{- 8}$ ），用于防止分母为零。
优缺点:
- 优点: 自动调整学习率，无需手动设置。
- 缺点: 由于 $M_{i i, k}$ 是一个持续累加的正数，它会单调递增。这导致分母不断变大，学习率会持续衰减，最终可能变得过小，使得模型在后期无法有效学习。

动量法 (Momentum-based GD)

动机:
在梯度下降的过程中，如果损失函数的等高线是狭长的椭圆形，梯度方向会垂直于等高线，导致更新路径呈“之”字形，收敛缓慢。动量法旨在缓解这个问题并加速收敛。
核心思想:
引入物理学中的“惯性”或“动量”概念。参数的更新方向不仅取决于当前所在位置的梯度，还受到其历史更新方向的影响。
更新公式:
动量法引入一个“速度”向量 $v$ ，它累积了历史梯度的信息：
$v_{k} \leftarrow β v_{k - 1} + (1 - β) \nabla_{w} C (w_{k})$ $w_{k + 1} \leftarrow w_{k} - η v_{k}$
- $v_{k}$ 是第 $k$ 步的速度向量，它是所有历史梯度的指数加权移动平均。
- $β$ 是动量因子，通常取值接近1（如0.9），它决定了历史速度在多大程度上被保留。当梯度方向变化时，动量可以平滑更新，抑制震荡；当梯度方向一致时，动量会累积，使得更新加速。

Nesterov 加速梯度 (NAG)

动机:
动量法虽然有效，但有一个小缺陷：它是先计算当前位置的梯度，再结合历史速度进行“盲目”的更新。NAG 通过引入“预判”机制，使其更加智能。
核心思想:
“先跳再修正”。NAG 不在当前位置 $w_{k}$ 计算梯度，而是先用历史速度 $v_{t - 1}$ “预估”一下下一步大致会跳到哪里（一个临时位置 $w_{k} - β v_{t - 1}$ ），然后在这个未来的位置计算梯度，用这个更具前瞻性的梯度来修正最终的更新方向。
更新公式:
$v_{t} \leftarrow β v_{t - 1} + η \nabla_{w} C (w_{k} - β v_{t - 1})$ $w_{k + 1} \leftarrow w_{k} - v_{k}$
- 关键区别在于梯度计算点 $\nabla_{w} C (\cdot)$ 是在临时更新点 $w_{k} - β v_{t - 1}$ ，而不是当前点 $w_{k}$ 。这使得 NAG 能够更早地感知到函数曲率的变化，从而提前减速，减少震荡，收敛更快。

分类器的间隔 (Margin of Classifier)

正确分类的数学表达:
对于一个线性分类器，其决策边界由 $θ^{⊤} x = 0$ 定义。我们通常使用符号函数 $sign (θ^{⊤} x)$ 来进行预测。如果一个样本 $(x, y)$ 被正确分类，那么预测的符号必须和真实标签 $y$ (取值为+1或-1) 的符号一致。这可以简洁地写成一个统一的表达式：
$y (θ^{⊤} x) > 0$
定义 1.1: 函数间隔 (Functional Margin):
- 定义: 样本 $(x, y)$ 关于分类器 $θ$ 的函数间隔就是表达式 $y (θ^{⊤} x)$ 的值。
- 解释:
  - 符号: 函数间隔的正负号代表了分类是否正确。正数表示正确，负数表示错误。
  - 大小: 函数间隔的绝对值 $| y (θ^{⊤} x) |$ 代表了分类的置信度。值越大，表示数据点离决策边界越远，我们对这个分类结果就越有信心。

线性可分 (Linearly Separable)

定义 1.2:
如果存在一个参数 $θ$ ，使得对于数据集中的所有样本 $(x_{t}, y_{t})$ ，它们的函数间隔都为正，即： $y_{t} (θ^{⊤} x_{t}) > 0, \forall t = 1, \dots, n$ 那么这个数据集就被称为线性可分的。
几何意义: 这意味着，我们可以找到一个超平面，能够像一把刀切蛋糕一样，完美地将两类数据点分到超平面的两侧，没有任何一个点被切错。

γ-线性可分 (γ-linearly separable)

定义 1.3:
这是一个比线性可分更强的条件。如果存在一个参数 $θ$ 和一个正数 $γ > 0$ ，使得对于数据集中的所有样本，它们的函数间隔都不小于 $γ$ ，即： $y_{t} (⟨ θ, x_{t} ⟩) \geq γ, \forall t = 1, \dots, n$
核心思想: 这不仅要求数据能被分开，还要求在两类数据之间必须存在一条“安全地带”或“缓冲区”，而这个缓冲区的最小“函数宽度”就是 $γ$ 。所有的数据点都必须离决策边界至少有这么远（以函数间隔衡量）。

几何间隔 (Geometric Margin)

动机:
函数间隔 $y (θ^{⊤} x)$ 存在一个问题：如果我们把 $θ$ 替换为 $2 θ$ ，决策边界 $θ^{⊤} x = 0$ 并未改变，但所有点的函数间隔都翻了一倍。这意味着函数间隔的大小是相对的，无法作为衡量真实距离的绝对标准。
定义:
几何间隔是数据点到超平面的真实欧几里得距离，它是通过对函数间隔进行归一化得到的：
$γ_{geom} = \frac{y (θ^{⊤} x)}{∥ θ ∥} = \frac{函数间隔}{参数向量的长度}$
意义: 几何间隔是一个绝对的、不受缩放影响的度量。它真实地反映了分类边界的“稳健性”。几何间隔越大，分类器对新数据的泛化能力通常越好。支持向量机(SVM)的核心目标，就是找到那个能使最小几何间隔最大化的决策超平面。

最大间隔线性分类器

SVM 优化问题的推导:
1. 目标: 最大化数据集中所有样本的最小几何间隔： $max_{θ, θ_{0}} (min_{t} \frac{y_{t} (x_{t}^{⊤} θ + θ_{0})}{∥ θ ∥})$
2. 简化: 我们可以通过缩放 $θ$ 和 $θ_{0}$ 来任意设定最小的函数间隔。为了方便，我们进行规范化，令 $min_{t} y_{t} (x_{t}^{⊤} θ + θ_{0}) = 1$ 。
3. 转化: 在这个规范化条件下，最大化 $\frac{1}{∥ θ ∥}$ 就等价于最小化 $∥ θ ∥$ ，进一步等价于最小化 $\frac{1}{2} ∥ θ ∥^{2}$ （这是一个二次函数，更易于优化）。
4. 约束: 规范化条件 $min_{t} y_{t} (x_{t}^{⊤} θ + θ_{0}) = 1$ 意味着对于所有点，都必须满足 $y_{t} (x_{t}^{⊤} θ + θ_{0}) \geq 1$ 。
SVM 的原始形式 (Primal form of SVM):
综合以上推导，我们得到了硬间隔 SVM 的标准优化问题：
$min_{θ, θ_{0}} \frac{1}{2} ∥ θ ∥^{2} subject to y_{t} (x_{t}^{⊤} θ + θ_{0}) \geq 1, \forall t$

Primal-SVM：唯一解证明

结论: 硬间隔SVM的优化问题有且仅有一个全局最优解。
证明方法: 反证法 (Proof by Contradiction)。
证明步骤:
1. 假设: 存在两个不同的最优解 $θ_{1}$ 和 $θ_{2}$ ( $θ_{1} \neq θ_{2}$ )。
2. 性质: 因为它们都是最优解，所以它们的目标函数值必须相等，即 $\frac{1}{2} ∥ θ_{1} ∥^{2} = \frac{1}{2} ∥ θ_{2} ∥^{2} ⟹ ∥ θ_{1} ∥ = ∥ θ_{2} ∥$ 。
3. 构造新解: 考虑它们的中点 $\bar{θ} = \frac{θ_{1} + θ_{2}}{2}$ 。
4. 验证可行性: 我们可以证明 $\bar{θ}$ 仍然满足约束条件 $y_{t} (x_{t}^{⊤} \bar{θ}) \geq 1$ 。因为 $y_{t} (x_{t}^{⊤} θ_{1}) \geq 1$ 和 $y_{t} (x_{t}^{⊤} θ_{2}) \geq 1$ ，两者相加再除以2，不等式依然成立。
5. 导出矛盾: 根据向量的严格三角不等式，只要 $θ_{1}$ 和 $θ_{2}$ 不是同向的（在这里因为它们不同但长度相等，所以必然不同向），那么它们和的长度必然小于它们长度的和。即 $∥ θ_{1} + θ_{2} ∥ < ∥ θ_{1} ∥ + ∥ θ_{2} ∥$ 。因此： $∥ \bar{θ} ∥ = ‖ \frac{θ_{1} + θ_{2}}{2} ‖ < \frac{∥ θ_{1} ∥ + ∥ θ_{2} ∥}{2} = ∥ θ_{1} ∥$
6. 结论: 我们找到了一个新解 $\bar{θ}$ ，它的范数比假设的最优解 $θ_{1}$ 的范数还要小，这与“ $θ_{1}$ 是最优解”的前提相矛盾。因此，最初的假设是错误的，最优解必须是唯一的。

软间隔 SVM (Soft Margin SVM)

动机:
在现实数据中，数据往往不是完全线性可分的，或者存在一些异常点。硬间隔 SVM 要求所有点都必须被严格分开，这使得它在这些情况下无法找到解。
核心思想:
放宽硬性约束，允许一些样本点“犯规”。具体来说，允许某些点：
1. 进入间隔区内（分类正确，但离边界太近）。
2. 被错误地分到另一边。
  但是，我们要为这些“犯规”的行为付出一定的代价。
软间隔 SVM 的原始形式:
我们为每个样本 $(x_{t}, y_{t})$ 引入一个松弛变量 (slack variable) $ξ_{t} \geq 0$ 。
$min_{θ, θ_{0}, ξ} \frac{1}{2} ∥ θ ∥^{2} + C \sum_{t = 1}^{n} ξ_{t}$
约束条件变为：
$y_{t} (x_{t}^{⊤} θ + θ_{0}) \geq 1 - ξ_{t}, \forall t$
公式解释:
- $ξ_{t}$ (xi): 衡量了第 $t$ 个样本“犯规”的程度。
  - 如果 $ξ_{t} = 0$ ，说明该点满足硬间隔要求。
  - 如果 $0 < ξ_{t} \leq 1$ ，说明该点在间隔区内，但分类仍正确。
  - 如果 $ξ_{t} > 1$ ，说明该点已经被错误分类。
- $\sum ξ_{t}$ : 所有样本的总“犯规”量。
- $C$ : 惩罚参数，是一个非常重要的超参数。它控制了我们对“犯规”的容忍程度。
  - $C$ 很大: 我们对犯规的惩罚很重，模型会尽力减小 $\sum ξ_{t}$ ，趋向于找到一个尽可能将所有点都正确分类的解，即使这意味着间隔很窄（容易过拟合）。
  - $C$ 很小: 我们对犯规的容忍度很高，模型更愿意为了获得一个更宽的间隔（减小 $∥ θ ∥^{2}$ ）而容忍一些点的犯规（允许 $\sum ξ_{t}$ 较大），通常泛化能力更好。
Hinge Loss 形式 (Slide 83):
软间隔SVM的优化问题可以被等价地写成一个无约束的优化问题。从约束 $y_{t} (\dots) \geq 1 - ξ_{t}$ 和 $ξ_{t} \geq 0$ 可以推导出，对于每个样本，我们希望其 $ξ_{t}$ 尽可能小，所以 $ξ_{t} = max (0, 1 - y_{t} (x_{t}^{⊤} θ + θ_{0}))$ 。代入目标函数，得到：
$min_{θ, θ_{0}} \underset{正则项/最大化间隔}{\underset{⏟}{\frac{1}{2} ∥ θ ∥^{2}}} + C \sum_{t = 1}^{n} \underset{Hinge Loss / 经验风险}{\underset{⏟}{max (0, 1 - y_{t} (x_{t}^{⊤} θ + θ_{0}))}}$