Lecture 1: Linear Algebra: Systems of Linear Equations, Matrices, Vector Spaces, Linear Independence

Part I: Concept

Vector:
Objects that can be added together and scaled (multiplied by scalars). These operations must satisfy certain axioms (e.g., commutativity of addition, distributivity of scalar multiplication over vector addition).
- Examples: Geometric vectors (arrows in 2D/3D space), Polynomials ( $a x^{2} + b x + c$ ), Audio signals, Tuples in $R^{n}$ (e.g., $(x_{1}, x_{2}, \dots, x_{n})$ ).
Closure (封闭性):
A fundamental property of a set with respect to specific operations (here, vector addition and scalar multiplication). It means that if you take any two elements from the set and perform the operation, the result will always also be an element of that same set. If a non-empty set of vectors satisfies closure under vector addition and scalar multiplication (along with other axioms), it forms a Vector Space.
- Significance: Closure ensures that the algebraic structure (the set and its operations) is self-contained and consistent. It's a cornerstone for defining what a vector space is.
Solution of the linear equation system:
An n-tuple $(x_{1}, \dots, x_{n}) \in R^{n}$ that simultaneously satisfies all equations in a given system of linear equations. Each component $x_{i}$ represents the value for the corresponding variable.
- Connection to Vectors: Each such n-tuple is itself a vector in $R^{n}$ . Therefore, finding the solutions to a linear system is equivalent to finding specific vectors that fulfill the given conditions.
A system have:
- No solution (inconsistent)
- Exactly one solution (unique)
- Infinity solutions (underdetermined)
Matrix Notation:
A system of linear equations can be compactly represented using matrix multiplication. For a system of $m$ linear equations in $n$ unknowns:
$\begin{aligned} a_{11} x_{1} + a_{12} x_{2} + \dots + a_{1 n} x_{n} & = b_{1} \\ a_{21} x_{1} + a_{22} x_{2} + \dots + a_{2 n} x_{n} & = b_{2} \\ ⋮ \\ a_{m 1} x_{1} + a_{m 2} x_{2} + \dots + a_{m n} x_{n} & = b_{m} \end{aligned}$
- could be written as: $A x = b$
  Where:
  - $A = [\begin{matrix} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{22} & \dots & a_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{m 1} & a_{m 2} & \dots & a_{m n} \end{matrix}]$ is the coefficient matrix.
  - $x = [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{n} \end{matrix}]$ is the variable vector (or unknowns vector).
  - $b = [\begin{matrix} b_{1} \\ b_{2} \\ ⋮ \\ b_{m} \end{matrix}]$ is the constant vector (or right-hand side vector).
Augmented Matrix (增广矩阵):
When solving a system of linear equations $A x = b$ using row operations (like Gaussian elimination), it's convenient to combine the coefficient matrix $A$ and the constant vector $b$ into a single matrix called the augmented matrix.
- Notation: It is typically written as [A | b], where the vertical line separates the coefficient matrix from the constant vector.
- Structure:
  If $A$ is an $m \times n$ matrix and $b$ is an $m \times 1$ column vector, then the augmented matrix [A | b] is an $m \times (n + 1)$ matrix.
$[A | b] = [\begin{matrix} a_{11} & a_{12} & \dots & a_{1 n} & | & b_{1} \\ a_{21} & a_{22} & \dots & a_{2 n} & | & b_{2} \\ ⋮ & ⋮ & ⋱ & ⋮ & | & ⋮ \\ a_{m 1} & a_{m 2} & \dots & a_{m n} & | & b_{m} \end{matrix}]$
- Purpose:
  This notation allows us to perform elementary row operations on the entire system (both coefficients and constants) simultaneously, simplifying the process of finding the solutions. Each row of the augmented matrix directly corresponds to an equation in the linear system.**
- could be written as: $W x = b$

Part II: Matrix Operation

Matrix Addition:
- For $A, B \in R^{n \times m}$ :$$(A+B){ij}=a+b_{ij}$$
Matrix Multiplication
- For $A \in R^{m \times n}, B \in R^{n \times k}, C = A B \in R^{m \times k}$ :
$c_{i j} = \sum_{l = 1}^{n} a_{i l} b_{l j}$
- Multiplication is only defined if the inner dimensions match:
$A_{m \times n} B_{n \times k}$
- Elementwise multiplication is called the Hadamard product:
$(A \circ B)_{i j} = a_{i j} b_{i j}$
- Identity matrix:
$l_{n} = (\begin{matrix} 1 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 1 \end{matrix}) \in R^{n \times n}$
- Multiplicative identity:
$I_{n} A = A I_{n} = A \in R^{m \times n}$
- Algebraic properties:
  - Associativity: $(A B) C = A (B C)$
  - Distributivity: $(A + B) C = A C + B C, A (C + D) = A C + A D$
Matrix Inverse: For $A, B \in R^{n \times n}$ is called the inverse of $A$ if $B$ if $A B = I_{n} = B A$ . Denoted as $A^{- 1}$
- Invertibility: $A$ is called regular/invertible/nonsingular if $A^{- 1}$ exists. Otherwise, it is singular/noninvertible.
- Uniqueness: If $A^{- 1}$ exists, it is unique.
- $A^{- 1}$ exists $⟺$ $| A | \neq 0 :$
$A^{- 1} = \frac{1}{| A |} adj (A)$
- Properties:
  - $A A^{- 1} = I = A^{- 1} A$
  - $(A B)^{- 1} = B^{- 1} A^{- 1}$
Matrix Transpose
- Transpose: For $A \in R^{m \times n}$ , $A^{T} \in R^{n \times m}$ is defined by $(A^{T})_{i j} = a_{j i}$
- Properties:
  - $(A^{T})^{T} = A$
  - $(A B)^{T} = B^{T} A^{T}$
  - $(A + B)^{T} = A^{T} + B^{T}$
  - If $A$ is invertible, $(A^{- 1})^{T} = (A^{T})^{- 1}$
Symmetric Matrix: $A \in R^{n \times n}$ is symmetric if $A = A^{T}$
- Sum: The sum of symmetric matrix is symmetric
- Properties:
  - **Symmetry under Congruence Transformation: **
  $P A P^{T} = (P A P^{T})^{T}$
  - Diagonalizability of Real Symmetric Matrices:
    - Every real symmetric matrix is orthogonally diagonalizable. This means there exists an orthogonal matrix $Q$ (where $Q^{T} Q = I$ ) and a diagonal matrix $D$ such that $A = Q D Q^{T}$ . The diagonal entries of $D$ are the eigenvalues of $A$ , and the columns of $Q$ are the corresponding orthonormal eigenvectors.
Scalar Multiplication:

(λ A)_{i j} = λ (A_{i j})

Solution:
- Consider the system:
$[\begin{matrix} 1 & 0 & 8 & - 4 \\ 0 & 1 & 2 & 12 \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{matrix}] = [\begin{matrix} 42 \\ 8 \end{matrix}]$
- Two equations, four unknowns: The system is underdetermined, so we expect infinitely many solutions.
- The first two columns form an identity matrix. This means $x_{1}$ and $x_{2}$ are pivot variables (or basic variables), and $x_{3}$ and $x_{4}$ are free variables.
  - To find a particular solution, we can set the free variables to zero.
  - Setting $x_{3} = 0$ and $x_{4} = 0$ gives:
    - From the first row: $1 \cdot x_{1} + 0 \cdot x_{2} + 8 \cdot 0 - 4 \cdot 0 = 42 ⟹ x_{1} = 42$
    - From the second row: $0 \cdot x_{1} + 1 \cdot x_{2} + 2 \cdot 0 + 12 \cdot 0 = 8 ⟹ x_{2} = 8$
- Thus, $[42, 8, 0, 0]^{T}$ is a particular solution (also called a special solution).
- To find the general solution for the non-homogeneous system Ax = b (which describes all infinitely many solutions), we need to understand the solutions to the associated homogeneous system: Ax = 0.
- Consider the homogeneous system:
$[\begin{matrix} 1 & 0 & 8 & - 4 \\ 0 & 1 & 2 & 12 \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{matrix}] = [\begin{matrix} 0 \\ 0 \end{matrix}]$
- Again, $x_{1}$ and $x_{2}$ are pivot variables, and $x_{3}$ and $x_{4}$ are free variables. We express the pivot variables in terms of the free variables:
  - From the first row: $x_{1} + 8 x_{3} - 4 x_{4} = 0 ⟹ x_{1} = - 8 x_{3} + 4 x_{4}$
  - From the second row: $x_{2} + 2 x_{3} + 12 x_{4} = 0 ⟹ x_{2} = - 2 x_{3} - 12 x_{4}$
- Let the free variables be parameters: $x_{3} = s$ and $x_{4} = t$ , where $s, t \in R$ .
- The homogeneous solution ( $x_{h}$ ) can be written in vector form:
$x_{h} = [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{matrix}] = [\begin{matrix} - 8 s + 4 t \\ - 2 s - 12 t \\ s \\ t \end{matrix}] = s [\begin{matrix} - 8 \\ - 2 \\ 1 \\ 0 \end{matrix}] + t [\begin{matrix} 4 \\ - 12 \\ 0 \\ 1 \end{matrix}]$
The vectors $[\begin{matrix} - 8 \\ - 2 \\ 1 \\ 0 \end{matrix}]$ and $[\begin{matrix} 4 \\ - 12 \\ 0 \\ 1 \end{matrix}]$ form a basis for the null space of matrix $A$ , denoted as $N (A)$ . These are also sometimes called special solutions to $A x = 0$ .
- The General Solution for Ax = b:
  The complete set of solutions for a consistent linear system Ax = b is the sum of any particular solution $x_{p}$ and the entire null space $N (A)$ .
$x = x_{p} + x_{h} = x_{p} + N (A)$
Using our specific example:
$x = [\begin{matrix} 42 \\ 8 \\ 0 \\ 0 \end{matrix}] + s [\begin{matrix} - 8 \\ - 2 \\ 1 \\ 0 \end{matrix}] + t [\begin{matrix} 4 \\ - 12 \\ 0 \\ 1 \end{matrix}] for any s, t \in R$
This formula describes all the infinitely many solutions to the original system Ax = b.
Rank-Nullity Theorem
- Theorem Statement:
  For an $m \times n$ matrix $A$ , the rank of $A$ plus the nullity of $A$ equals the number of columns $n$ .
  That is:
  rank(A) + nullity(A) = n
  This also implies:
  nullity(A) = n - rank(A)
- Explanation of Terms:
  - Rank of A (rank(A)):
    - Definition: The dimension of the Column Space (Col(A)) of matrix $A$ . It is equal to the number of pivot variables in the Reduced Row Echelon Form (RREF) of $A$ .
  - Nullity of A (nullity(A)):
    - Definition: The dimension of the Null Space (Nul(A)) of matrix $A$ . It is equal to the number of free variables in the Reduced Row Echelon Form (RREF) of $A$ .
  - $n$ (Number of Columns / Variables):
    - Definition: The number of columns of matrix $A$ , which represents the total number of unknowns in the system.
- Intuitive Meaning:
  This theorem fundamentally shows that the total number of variables in a system ( $n$ ) is divided into two parts: one part is constrained by the equations, whose count is the rank; the other part consists of variables that can be freely chosen in the solution, whose count is the nullity. That is, (Number of Pivot Variables) + (Number of Free Variables) = (Total Number of Variables).
- Example (Using the 2x4 Matrix):
  - Consider the matrix $A = [\begin{matrix} 1 & 0 & 8 & - 4 \\ 0 & 1 & 2 & 12 \end{matrix}]$ from our previous discussion.
  - Here, the number of columns $n = 4$ (as there are four unknowns $x_{1}, x_{2}, x_{3}, x_{4}$ ).
  - This matrix is already in Reduced Row Echelon Form.
    - Pivot Variables: $x_{1}, x_{2}$ (corresponding to the leading 1s in each row). Thus, rank(A) = 2.
    - Free Variables: $x_{3}, x_{4}$ (variables not corresponding to pivot positions). Thus, nullity(A) = 2.
  - Verifying the Theorem:
    - rank(A) + nullity(A) = 2 + 2 = 4. This matches the number of columns $n = 4$ .
    - nullity(A) = n - rank(A) \implies 2 = 4 - 2. This also holds perfectly true.
  - This example perfectly illustrates the Rank-Nullity Theorem.
Elementary Row Transformations (Elementary Row Operations)
- Definition:
  Elementary row transformations are a set of operations that can be performed on the rows of a matrix. These operations are crucial because they transform a matrix into an equivalent matrix (meaning they preserve the solution set of the corresponding linear system, and the row space, column space dimension, and null space of the matrix).
- Types of Elementary Row Transformations:
  There are three fundamental types of elementary row transformations:
  1. Row Swap (Interchange Two Rows):
    - Description: Exchange the positions of two rows.
    - Notation: $R_{i} \leftrightarrow R_{j}$ (swap Row $i$ with Row $j$ )
    - Example:
    $[\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}] \overset{R_{1} \leftrightarrow R_{2}}{\to} [\begin{matrix} 4 & 5 & 6 \\ 1 & 2 & 3 \\ 7 & 8 & 9 \end{matrix}]$
  2. Row Scaling (Multiply a Row by a Non-zero Scalar):
    - Description: Multiply all entries in a row by a non-zero constant scalar.
    - Notation: $k R_{i} \to R_{i}$ (multiply Row $i$ by scalar $k$ , where $k \neq 0$ )
    - Example:
    $[\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}] \overset{2 R_{1} \to R_{1}}{\to} [\begin{matrix} 2 & 4 & 6 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}]$
  3. Row Addition (Add a Multiple of One Row to Another Row):
    - Description: Add a scalar multiple of one row to another row. The row being added to is replaced by the result.
    - Notation: $R_{i} + k R_{j} \to R_{i}$ (add $k$ times Row $j$ to Row $i$ , and replace Row $i$ )
    - Example:
    $[\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}] \overset{R_{2} - 4 R_{1} \to R_{2}}{\to} [\begin{matrix} 1 & 2 & 3 \\ 0 & - 3 & - 6 \\ 7 & 8 & 9 \end{matrix}]$
- Purpose and Importance:
  - Solving Linear Systems: Elementary row transformations are the foundation of Gaussian elimination and Gauss-Jordan elimination, which are algorithms used to solve systems of linear equations by transforming the augmented matrix into row echelon form or reduced row echelon form.
  - Finding Matrix Inverse: They can be used to find the inverse of a square matrix.
  - Determining Rank: They help in finding the rank of a matrix (number of pivots/non-zero rows in REF, RREF).
  - Finding Null Space Basis: They are essential for transforming the matrix to RREF to identify free variables and determine the basis for the null space.
  - Equivalence: Two matrices are row equivalent if one can be transformed into the other using a sequence of elementary row transformations. Row equivalent matrices have the same row space, null space, and therefore the same rank.
- Importance:
  - If the matrix is:
$[\begin{matrix} 1 & 0 & 0 & 5 & | & 10 \\ 0 & 1 & 0 & - 2 & | & 7 \\ 0 & 0 & 0 & 0 & | & a + 1 \end{matrix}]$
* If and only if $a = - 1$ , it is sovlable.
* What does the row [0 0 0 0 | 0] mean?
* A row of all zeros, including the constant term, means that the original equation corresponding to this row was a linear combination of other equations in the system. In other words, this equation was redundant and provides no new information about the variables.
* Crucially, 0 = 0 is always a true statement. This indicates that the system is consistent (it has solutions). It does not imply that there are no solutions (an inconsistent system would have a row like [0 0 0 0 | c] where c ≠ 0).
Row Equivalent Matrices
- Definition:
  Two matrices are said to be row equivalent if one can be obtained from the other by a finite sequence of elementary row transformations.
- Mechanism:
  The concept is built upon the three Elementary Row Transformations (Row Swap, Row Scaling, Row Addition), which were previously discussed. Applying these operations one or more times will transform a matrix into a row equivalent one.
- Notation:
  If matrix $A$ is row equivalent to matrix $B$ , we write $A \sim B$ , $B$ is written as $\tilde{A}$ .
- Key Properties of Row Equivalent Matrices (What is Preserved):
  Elementary row transformations are powerful because they preserve several fundamental properties of a matrix, which are critical for solving linear systems and understanding matrix spaces:
  1. Same Solution Set for Linear Systems: If an augmented matrix $[A | b]$ is row equivalent to another augmented matrix $[A^{'} | b^{'}]$ , then the linear system $A x = b$ has exactly the same set of solutions as $A^{'} x = b^{'}$ . This is the underlying principle that allows us to solve systems by row reducing their augmented matrices.
  2. Same Row Space: The row space (the vector space spanned by the row vectors of the matrix) remains unchanged under elementary row transformations.
  3. Same Null Space: The null space (the set of all solutions to the homogeneous equation $A x = 0$ ) remains unchanged.
  4. Same Rank: Since the dimension of the row space and the dimension of the null space are preserved, the rank of the matrix (which is the dimension of the column space, and equals the dimension of the row space) is also preserved.
  5. Same Reduced Row Echelon Form (RREF): Every matrix is row equivalent to a unique Reduced Row Echelon Form (RREF). This unique RREF is often used as a canonical (standard) form for a matrix.
- Importance and Applications:
  - Solving Linear Systems: By transforming an augmented matrix into its RREF, we can directly read off the solutions, because the RREF is row equivalent to the original matrix and thus has the same solution set.
  - Finding Matrix Inverse: A square matrix $A$ is invertible if and only if it is row equivalent to the identity matrix $I$ .
  - Basis for Subspaces: Row operations are used to find bases for the row space, column space, and null space of a matrix.
- Example:
  Consider matrix $A = [\begin{matrix} 1 & 2 \\ 2 & 4 \end{matrix}]$ .
  We can perform the elementary row operation $R_{2} - 2 R_{1} \to R_{2}$ :
$[\begin{matrix} 1 & 2 \\ 2 & 4 \end{matrix}] \overset{R_{2} - 2 R_{1} \to R_{2}}{\to} [\begin{matrix} 1 & 2 \\ 0 & 0 \end{matrix}]$
Thus, $A \sim [\begin{matrix} 1 & 2 \\ 0 & 0 \end{matrix}]$ . These two matrices are row equivalent.
Calculating the Inverse via Augmented Matrix:
The Reduced Row Echelon Form (RREF) is extremely useful for inverting matrices. This strategy is also known as the Gauss-Jordan elimination method for inverses.
- Requirement: For this strategy, we need the matrix A to be square ( $A \in R^{n \times n}$ ). An inverse only exists for square matrices.
- Core Idea: To compute the inverse $A^{- 1}$ of an $n \times n$ matrix $A$ , we essentially solve the matrix equation $A X = I_{n}$ for the unknown matrix $X$ . The solution $X$ will be $A^{- 1}$ . Each column of $X$ represents the solution to $A x_{j} = e_{j}$ , where $e_{j}$ is the $j$ -th standard basis vector (a column of $I_{n}$ ).
- Procedure:
  1. Write the augmented matrix [A | I_n]:
    - Definition: This is an augmented matrix formed by concatenating the square matrix $A$ on the left side with the $n \times n$ identity matrix $I_{n}$ on the right side.
    - Purpose: This unified matrix allows us to perform elementary row operations on $A$ and, simultaneously, apply the same operations to $I_{n}$ . Each row operation on $[A | I_{n}]$ is equivalent to multiplying the original matrix $A$ (and $I_{n}$ ) by an elementary matrix from the left. By transforming $A$ into $I_{n}$ , we are effectively finding the product of elementary matrices that "undo" $A$ , which is precisely $A^{- 1}$ .
    - Example for a 2x2 matrix: If $A = [\begin{matrix} a & b \\ c & d \end{matrix}]$ , then $[A | I_{2}] = [\begin{matrix} a & b & | & 1 & 0 \\ c & d & | & 0 & 1 \end{matrix}]$ .
  2. Perform Gaussian Elimination (Row Reduction): Use elementary row transformations to bring the augmented matrix to its reduced row-echelon form. The goal is to transform the left block (where $A$ was) into the identity matrix $I_{n}$ . $[A | I_{n}] \overset{Gauss Elimination}{\to} [I_{n} | A^{- 1}]$
  3. Read the Inverse: If the left block successfully transforms into $I_{n}$ , then the right block of the final matrix will be $A^{- 1}$ .
  4. Case of Non-Invertibility: If, during the row reduction process, you cannot transform the left block into $I_{n}$ (e.g., if you end up with a row of zeros in the left block), then the matrix $A$ is singular (non-invertible), and $A^{- 1}$ does not exist.
  5. Proof: Compatibility of Matrix Multiplication with Partitioned Matrices $C [A | I_{n}] = [C A | C I_{n}] = [I_{n} | C] \Rightarrow C = A^{- 1}$
- Limitation: For non-square matrices, the augmented matrix method (to find a traditional inverse) is not defined because non-square matrices do not have inverses.
Algorithms for Solving Linear Systems (Ax = b): Direct Methods

This section outlines various direct algorithms used to find solutions for the linear system Ax = b.
- 1. Direct Inversion:
  - Applicability: This method is used if the coefficient matrix $A$ is square and invertible (i.e., non-singular).
  - Formula: The solution $x$ is directly computed as: $x = A^{- 1} b$
  - Mechanism: If $A$ is invertible, its inverse $A^{- 1}$ exists, and multiplying both sides of $A x = b$ by $A^{- 1}$ from the left yields $A^{- 1} A x = A^{- 1} b ⟹ I_{n} x = A^{- 1} b ⟹ x = A^{- 1} b$ .
- 2. Pseudo-inverse (Moore Penrose Pseudo inverse):
  - Applicability: This method is used if $A$ is not square but has linearly independent columns (i.e., full column rank). This is common in overdetermined systems (more equations than unknowns, $m > n$ ) where an exact solution might not exist, but we seek a "best fit" solution.
  - Formula: The solution $x$ is given by: $x = (A^{T} A)^{- 1} A^{T} b$
  - Result: This formula provides the minimum-norm least-squares solution. It finds the vector $x$ that minimizes the Euclidean norm of the residual, $∥ A x - b ∥_{2}^{2}$ . If an exact solution exists, this method finds it. If not, it finds the solution that is "closest" to satisfying the equations in a least-squares sense.
- Limitations (Common to Inversion and Pseudo-inversion Methods):
  - Computationally Expensive: Calculating matrix inverses or pseudo-inverses is generally computationally expensive, especially for large systems. The computational cost typically scales with $O (n^{3})$ for an $n \times n$ matrix.
  - Numerically Unstable: These methods can be numerically unstable for large or ill-conditioned systems, meaning small errors in input data or floating-point arithmetic can lead to large errors in the computed inverse and solution.
- 3. Gaussian Elimination:
  - Mechanism: This is a systematic method that reduces the augmented matrix [A | b] to row-echelon form or reduced row-echelon form to solve Ax = b. It involves a series of elementary row operations.
  - Scalability: Gaussian elimination is generally efficient for thousands of variables. However, it is not practical for very large systems (e.g., millions of variables) because its computational cost scales cubically with the number of variables ( $O (n^{3})$ ), making it too slow and memory-intensive for extremely large problems.

Part III: Vector Spaces and Groups

Group:
A group is a set $G$ together with a binary operation $*$ (that combines any two elements of $G$ to form a third element also in $G$ ) that satisfies the following four axioms:
1. Closure: For all $a, b \in G$ , the result of the operation $a * b$ is also in $G$ .
  - (Formally: $\forall a, b \in G, a * b \in G$ )
2. Associativity: For all $a, b, c \in G$ , the order in which multiple operations are performed does not affect the result.
  - (Formally: $\forall a, b, c \in G, (a * b) * c = a * (b * c)$ )
3. Identity Element: There exists an element $e \in G$ (called the identity element) such that for every element $a \in G$ , operating $e$ with $a$ (in any order) leaves $a$ unchanged.
  - (Formally: $\exists e \in G s.t. \forall a \in G, a * e = e * a = a$ )
4. Inverse Element: For every element $a \in G$ , there exists an element $a^{- 1} \in G$ (called the inverse of $a$ ) such that operating $a$ with $a^{- 1}$ (in any order) yields the identity element $e$ .
  - (Formally: $\forall a \in G, \exists a^{- 1} \in G s.t. a * a^{- 1} = a^{- 1} * a = e$ )
Additional Terminology:
- Abelian Group (Commutative Group): If, in addition to the four axioms above, the operation $*$ is also commutative (i.e., $a * b = b * a$ for all $a, b \in G$ ), then the group is called an Abelian group.
- Order of a Group: The number of elements in a group $G$ is called its order, denoted by $| G |$ . If the number of elements is finite, it's a finite group; otherwise, it's an infinite group.
Examples of Groups:
- The set of integers $Z$ under addition $(+)$ is an Abelian group.
  - (Closure: $m + n \in Z$ )
  - (Associativity: $(m + n) + p = m + (n + p)$ )
  - (Identity: $0 \in Z$ , $m + 0 = m$ )
  - (Inverse: for $m$ , $- m \in Z$ , $m + (- m) = 0$ )
- The set of non-zero rational numbers $Q^{*}$ under multiplication $(\times)$ is an Abelian group.
- The set of all invertible $n \times n$ matrices under matrix multiplication is a non-Abelian group (for $n \geq 2$ ). This is called the general linear group $G L_{n} (R)$ .

Image/Class/Mathematics-for-AI/2.png

Continuation of Notes

Vector Space (向量空间):
A vector space is a set of objects called vectors ( $V$ ), along with a set of scalars (usually the real numbers $R$ ), equipped with two operations: vector addition and scalar multiplication. These operations must satisfy ten axioms.

Axioms of a Vector Space:
Let $u, v, w$ be vectors in $V$ and let $c, d$ be scalars in $R$ .
1. Closure under Addition: $u + v$ is in $V$ .
2. Commutativity of Addition: $u + v = v + u$ .
3. Associativity of Addition: $(u + v) + w = u + (v + w)$ .
4. Zero Vector (Additive Identity): There is a vector $0$ in $V$ such that $u + 0 = u$ .
5. Additive Inverse: For every vector $u$ , there is a vector $- u$ in $V$ such that $u + (- u) = 0$ .
  - Connection to Groups: The first five axioms mean that the set of vectors $V$ with the addition operation $(V, +)$ forms an Abelian Group.
6. Closure under Scalar Multiplication: $c u$ is in $V$ .
7. Distributivity: $c (u + v) = c u + c v$ .
8. Distributivity: $(c + d) u = c u + d u$ .
9. Associativity of Scalar Multiplication: $c (d u) = (c d) u$ .
10. Scalar Identity: $1 u = u$ .
Subspace (子空间):
A subspace of a vector space $V$ is a subset $H$ of $V$ that is itself a vector space under the same operations of addition and scalar multiplication defined on $V$ .
- Subspace Test (子空间判别法): To verify if a subset $H$ is a subspace, we only need to check three conditions:
  1. Contains the Zero Vector: The zero vector of $V$ is in $H$ ( $0 \in H$ ).
  2. Closure under Addition: For any two vectors $u, v \in H$ , their sum $u + v$ is also in $H$ .
  3. Closure under Scalar Multiplication: For any vector $u \in H$ and any scalar $c$ , the vector $c u$ is also in $H$ .
- Key Examples:
  - Any line or plane in $R^{3}$ that passes through the origin is a subspace of $R^{3}$ .
  - The null space of an $m \times n$ matrix $A$ , denoted $N (A)$ , is a subspace of $R^{n}$ .
  - The column space of an $m \times n$ matrix $A$ , denoted $C o l (A)$ , is a subspace of $R^{m}$ .
Linear Combination (线性组合):
Given vectors $v_{1}, v_{2}, \dots, v_{p}$ in a vector space $V$ and scalars $c_{1}, c_{2}, \dots, c_{p}$ , the vector $y$ defined by:
$y = c_{1} v_{1} + c_{2} v_{2} + \dots + c_{p} v_{p}$
is called a linear combination of $v_{1}, \dots, v_{p}$ with weights $c_{1}, \dots, c_{p}$ .
Span (生成空间):
- Definition: The span of a set of vectors ${v_{1}, \dots, v_{p}}$ , denoted $Span {v_{1}, \dots, v_{p}}$ , is the set of all possible linear combinations of these vectors.
- Geometric Interpretation:
  - $Span {v}$ (where $v \neq 0$ ) is the line passing through the origin and $v$ .
  - $Span {u, v}$ (where $u, v$ are not collinear) is the plane containing the origin, $u$ , and $v$ .
- Property: The span of any set of vectors is always a subspace.
Linear Independence and Dependence (线性无关与线性相关):
- Linear Independence: A set of vectors ${v_{1}, \dots, v_{p}}$ is linearly independent if the vector equation $c_{1} v_{1} + c_{2} v_{2} + \dots + c_{p} v_{p} = 0$ has only the trivial solution ( $c_{1} = c_{2} = \dots = c_{p} = 0$ ).
- Linear Dependence: The set is linearly dependent if there exist weights $c_{1}, \dots, c_{p}$ , not all zero, such that the equation holds.
- Intuitive Meaning: A set of vectors is linearly dependent if and only if at least one of the vectors can be written as a linear combination of the others. Linearly independent vectors are non-redundant.
Basis (基):
A basis for a vector space $V$ is a set of vectors $B = {b_{1}, \dots, b_{n}}$ that satisfies two conditions:
1. The set $B$ is linearly independent.
2. The set $B$ spans the vector space $V$ (i.e., $Span (B) = V$ ).
- A basis is a "minimal" set of vectors needed to build the entire space.
Dimension (维度):
- Definition: The dimension of a non-zero vector space $V$ , denoted $dim (V)$ , is the number of vectors in any basis for $V$ . The dimension of the zero vector space ${0}$ is defined to be 0.
- Uniqueness: Although a vector space can have many different bases, all bases for a given vector space have the same number of vectors.
- Connection to Rank-Nullity:
  - The dimension of the column space of a matrix $A$ is its rank: $dim (Col (A)) = rank (A)$ .
  - The dimension of the null space of a matrix $A$ is its nullity: $dim (N (A)) = nullity (A)$ .
Testing for Linear Independence using Gaussian Elimination:

To test if a set of vectors ${v_{1}, \dots, v_{k}}$ is linearly independent:
1. Form a matrix $A$ using these vectors as its columns.
2. Perform Gaussian elimination on matrix $A$ to reduce it to row echelon form.
- The original vectors that correspond to pivot columns are linearly independent.
- The vectors that correspond to non-pivot columns can be written as a linear combination of the preceding pivot columns.
Example: In the following row echelon form matrix:
$(\begin{matrix} 1 & 3 & 0 \\ 0 & 0 & 1 \end{matrix})$
Column 1 and Column 3 are pivot columns (their corresponding original vectors are independent); Column 2 is a non-pivot column (its corresponding original vector is dependent).

Therefore, the original set of vectors (the columns of matrix $A$ ) is not linearly independent because there is at least one non-pivot column. In other words, the set is linearly dependent.
Linear Independence of Linear Combinations

Let's consider a set of $k$ linearly independent vectors ${b_{1}, \dots, b_{k}}$ , which can be seen as a basis for a $k$ -dimensional space. We can form a new set of $m$ vectors ${x_{1}, \dots, x_{m}}$ where each $x_{j}$ is a linear combination of the base vectors:
$x_{j} = \sum_{i = 1}^{k} λ_{i j} b_{i}$
Each set of weights can be represented by a coefficient vector $λ_{j} \in R^{k}$ .
- Key Implication: The set of new vectors ${x_{1}, \dots, x_{m}}$ is linearly independent if and only if the set of their corresponding coefficient vectors ${λ_{1}, \dots, λ_{m}}$ is linearly independent.
The Dimension Theorem for Spanning Sets (A Fundamental Theorem)
The dimension of a vector space cannot exceed the number of vectors in any of its spanning sets. A direct consequence is that in a vector space of dimension $k$ , any set containing more than $k$ vectors must be linearly dependent.
Special Case: More New Vectors than Base Vectors ( $m > k$ )

Theorem: If you use $k$ linearly independent vectors to generate $m$ new vectors, and $m > k$ , the resulting set of new vectors ${x_{1}, \dots, x_{m}}$ is always linearly dependent.

Proof (using Matrix Rank):
1. Focus on the Coefficient Vectors: As established, the linear independence of ${x_{j}}$ is equivalent to the linear independence of their coefficient vectors ${λ_{j}}$ . We will prove that the set ${λ_{1}, \dots, λ_{m}}$ must be linearly dependent.
2. Construct the Coefficient Matrix: Let's arrange these coefficient vectors as the columns of a matrix, $Λ$ :
  $Λ = [λ_{1}, λ_{2}, \dots, λ_{m}]$
  Since each coefficient vector $λ_{j}$ is in $R^{k}$ , the matrix $Λ$ has $k$ rows and $m$ columns (it is a $k \times m$ matrix).
3. Analyze the Rank of the Matrix: The rank of a matrix has a fundamental property: it cannot exceed its number of rows or its number of columns. Specifically, we are interested in the fact that $rank (Λ) \leq k$ (the number of rows).
  - (Justification via a more fundamental theorem: The rank is the dimension of the row space. The row space is spanned by $k$ row vectors. By The Dimension Theorem for Spanning Sets, the dimension of this space cannot exceed $k$ .)
4. Apply the Condition $m > k$ : We have established two key facts about the matrix $Λ$ :
  - The total number of columns is $m$ .
  - The rank, which represents the maximum number of linearly independent columns, is at most $k$ . That is, $rank (Λ) \leq k$ .
5. Connect Rank to Linear Dependence: We are given that $m > k$ . This leads to the crucial inequality:
  $Total number of columns (m) > Maximum number of linearly independent columns (rank (Λ))$
  This inequality means it is impossible for all $m$ columns of $Λ$ to be linearly independent. If you have more vectors ( $m$ ) than the dimension of the space they can span (the rank, which is at most $k$ ), the set of vectors must be linearly dependent.
6. Draw the Conclusion: Because the columns of $Λ$ (which are the coefficient vectors ${λ_{j}}$ ) form a linearly dependent set, the set of new vectors ${x_{j}}$ that they define must also be linearly dependent. Q.E.D.

Lecture 2: Linear Algebra: Basis and Rank, Linear Mappings, Affine Spaces

Part I: Basis and Rank

Generating Set (or Spanning Set)
- Definition: A set of vectors $S$ is called a generating set for a vector space $V$ if $Span (S) = V$ .
- Key Idea: A generating set can be redundant; it may contain linearly dependent vectors.
- Example: The set $S = {(1, 0), (0, 1), (1, 1)}$ is a generating set for $R^{2}$ . It is redundant because $(1, 1)$ is a linear combination of the other two vectors.
Span (Additional Property)
- Connection to Linear Systems: A system of linear equations $A x = b$ has a solution if and only if the vector $b$ is in the span of the columns of matrix $A$ . That is, $b \in Col (A)$ .
Basis (Additional Properties)
- Unique Representation Theorem: A key property of a basis is that every vector $v$ in the space can be expressed as a linear combination of the basis vectors in exactly one way. The coefficients of this unique combination are called the coordinates of $v$ with respect to that basis.
- Example (The Standard Basis): The most common basis for $R^{n}$ is the standard basis, which consists of the columns of the $n \times n$ identity matrix $I_{n}$ . For $R^{3}$ , the standard basis is ${e_{1}, e_{2}, e_{3}} = {(\begin{matrix} 1 \\ 0 \\ 0 \end{matrix}), (\begin{matrix} 0 \\ 1 \\ 0 \end{matrix}), (\begin{matrix} 0 \\ 0 \\ 1 \end{matrix})}$ .
Characterizations of a Basis
For a non-empty set of vectors $B$ in a vector space $V$ , the following statements are equivalent (meaning if one is true, all are true):
1. $B$ is a basis of $V$ .
2. $B$ is a minimal generating set (i.e., it spans $V$ , but no proper subset of $B$ spans $V$ ).
3. $B$ is a maximal linearly independent set (i.e., it's linearly independent, but adding any other vector from $V$ to it would make the set linearly dependent).
Further Properties of Dimension
- Existence and Uniqueness of Size: Every non-trivial vector space has a basis. While a space can have many different bases, all of them will have the same number of vectors. This makes the concept of dimension well-defined.
- Subspace Dimension: If $U$ is a subspace of a vector space $V$ , then $dim (U) \leq dim (V)$ . Equality holds if and only if $U = V$ .
- Important Clarification: The dimension of a space refers to the number of vectors in its basis, not the number of components in each vector. For example, the subspace spanned by the single vector $(\begin{matrix} 1 \\ 2 \\ 3 \end{matrix})$ is one-dimensional, even though the vector lives in $R^{3}$ .
How to Find a Basis for a Subspace (Basis Extraction Method)

To find a basis for a subspace $U$ that is defined as the span of a set of vectors ${v_{1}, \dots, v_{m}}$ :
1. Create a matrix $A$ where the columns are the vectors ${v_{1}, \dots, v_{m}}$ .
2. Reduce the matrix $A$ to its row echelon form.
3. Identify the columns that contain pivots.
4. The basis for $U$ consists of the original vectors from the set ${v_{1}, \dots, v_{m}}$ that correspond to these pivot columns.
Rank of a Matrix
- Definition: The rank of a matrix $A$ , denoted $rk (A)$ , is the number of linearly independent columns in $A$ . A fundamental theorem of linear algebra states that this number is always equal to the number of linearly independent rows.
- Key Property: The rank of a matrix is equal to the rank of its transpose: $rk (A) = rk (A^{T})$ .
Rank and its Connection to Fundamental Subspaces
- Column Space (Image/Range): The rank of $A$ is the dimension of its column space.
  $rk (A) = dim (Col (A))$
- Null Space (Kernel): The rank determines the dimension of the null space through the Rank-Nullity Theorem. For an $m \times n$ matrix $A$ :
  $dim (Nul (A)) = n - rk (A)$
Properties and Applications of Rank
- Invertibility of Square Matrices: An $n \times n$ matrix $A$ is invertible if and only if its rank is equal to its dimension, i.e., $rk (A) = n$ . This is because a full-rank square matrix can be row-reduced to the identity matrix.
- Solvability of Linear Systems: The system $A x = b$ has at least one solution if and only if the rank of the coefficient matrix $A$ is equal to the rank of the augmented matrix $[A | b]$ .
  Reasoning: If $rk (A) < rk ([A | b])$ , it means that the vector $b$ is linearly independent of the columns of $A$ . Therefore, $b$ cannot be written as a linear combination of the columns of $A$ , and no solution exists.
- Full Rank and Rank Deficiency:
  - A matrix has full rank if its rank is the maximum possible for its dimensions: $rk (A) = min (m, n)$ .
  - A matrix is rank deficient if $rk (A) < min (m, n)$ , indicating linear dependencies among its rows or columns.
Why Rank is Important

The rank of a matrix is a core concept that reveals its fundamental structure. It tells us:
- The maximum number of linearly independent rows/columns.
- The dimension of the data (the dimension of the subspace spanned by the columns).
- Whether a linear system is consistent (has solutions).
- Whether a square matrix has an inverse.
- It is crucial for identifying redundancy and simplifying problems in data analysis, optimization, and machine learning.
Summary: Tying Rank, Basis, and Pivots Together
1. You start with a set of vectors.
2. You place them as columns in a matrix $A$ .
3. You perform Gaussian elimination to find the pivots.
4. The number of pivots is the rank of the matrix $A$ .
5. This rank is also the dimension of the subspace spanned by the original vectors.
6. The original vectors corresponding to the pivot columns form a basis for that subspace.

Part II: Linear Mappings

Linear Mappings (Linear Transformations)
- Definition: A mapping (or function) $Φ : V \to W$ from a vector space $V$ to a vector space $W$ is called linear if it preserves the two fundamental vector space operations:
  1. Additivity: $Φ (x + y) = Φ (x) + Φ (y)$ for all $x, y \in V$ .
  2. Homogeneity: $Φ (λ x) = λ Φ (x)$ for any scalar $λ$ .
- Matrix Representation: Any linear mapping between finite-dimensional vector spaces can be represented by matrix multiplication: $Φ (x) = A x$ for some matrix $A$ .
Properties of Mappings: Injective, Surjective, Bijective
- Injective (One-to-one): A mapping is injective if distinct inputs always map to distinct outputs. Formally, if $Φ (x) = Φ (y)$ , then it must be that $x = y$ .
- Surjective (Onto): A mapping is surjective if its range is equal to its codomain. This means every element in the target space $W$ is the image of at least one element from the starting space $V$ .
- Bijective: A mapping is bijective if it is both injective and surjective. A bijective mapping has a unique inverse mapping, denoted $Φ^{- 1}$ .
Special Types of Linear Mappings
- Homomorphism: For vector spaces, a homomorphism is simply another term for a linear mapping. It's a map that preserves the algebraic structure (addition and scalar multiplication).
- Isomorphism: A linear mapping that is also bijective. Isomorphic vector spaces are structurally identical, just with potentially different-looking elements.
- Endomorphism: A linear mapping from a vector space to itself ( $Φ : V \to V$ ). It does not need to be invertible.
- Automorphism: An endomorphism that is also bijective. It is an isomorphism from a vector space to itself (e.g., a rotation or reflection).
- Identity Mapping: The map defined by $id (x) = x$ . It leaves every vector unchanged and is the simplest example of an automorphism.
Isomorphism
- Isomorphism and Dimension: A fundamental theorem states that two finite-dimensional vector spaces, $V$ and $W$ , are isomorphic (structurally identical) if and only if they have the same dimension.
  $dim (V) = dim (W) ⟺ V ≅ W$
- Intuition: This means any n-dimensional vector space is essentially a "re-labeling" of $R^{n}$ .
- Properties of Linear Mappings:
  - The composition of two linear mappings is also a linear mapping.
  - The inverse of an isomorphism is also an isomorphism.
  - The sum and scalar multiple of linear mappings are also linear.
Matrix Representation via Ordered Bases
The isomorphism between an abstract n-dimensional space $V$ and the concrete space $R^{n}$ is made practical by choosing an ordered basis. The order of the basis vectors matters for defining coordinates.
- Notation: We denote an ordered basis with parentheses, e.g., $B = (b_{1}, \dots, b_{n})$ .
Coordinates and Coordinate Vectors
- Definition: Given an ordered basis $B = (b_{1}, \dots, b_{n})$ of $V$ , every vector $x \in V$ can be written uniquely as: $x = α_{1} b_{1} + \dots + α_{n} b_{n}$ The scalars $α_{1}, \dots, α_{n}$ are called the coordinates of $x$ with respect to the basis $B$ .
- Coordinate Vector: We collect these coordinates into a single column vector, which represents $x$ in the standard space $R^{n}$ : $[x]_{B} = (\begin{matrix} α_{1} \\ ⋮ \\ α_{n} \end{matrix}) \in R^{n}$
Coordinate Systems and Change of Basis
- Concept: A basis defines a coordinate system for the vector space. The familiar Cartesian coordinates in $R^{2}$ are simply the coordinates with respect to the standard basis $(e_{1}, e_{2})$ . Any other basis defines a different, but equally valid, coordinate system.
- Example: A single vector $x \in R^{2}$ has different coordinates in different bases. For instance, its coordinate vector might be $(\begin{matrix} 2 \\ 2 \end{matrix})$ with respect to the standard basis, but $(\begin{matrix} 1.09 \\ 0.72 \end{matrix})$ with respect to another basis $B = (b_{1}, b_{2})$ . This means $x = 2 e_{1} + 2 e_{2}$ and also $x = 1.09 b_{1} + 0.72 b_{2}$ .
Importance for Linear Mappings
Once we fix ordered bases for the input and output spaces, we can represent any linear mapping as a concrete matrix. This matrix representation is entirely dependent on the chosen bases.

Part III: Basis Change and Transformation Matrices

This section covers the representation of abstract linear mappings as matrices and the mechanics of changing coordinate systems.

1. The Transformation Matrix for a Linear Map

A transformation matrix provides a concrete computational representation for an abstract linear mapping, relative to chosen bases.

Definition and Context:
We are given a linear map $Φ : V \to W$ , an ordered basis $B = (b_{1}, \dots, b_{n})$ for the vector space $V$ , and an ordered basis $C = (c_{1}, \dots, c_{m})$ for the vector space $W$ .
Construction of the Transformation Matrix ( $A_{Φ}$ ):
The transformation matrix $A_{Φ}$ is an $m \times n$ matrix whose columns describe how the input basis vectors are transformed by $Φ$ . It is constructed as follows:
1. For each input basis vector $b_{j}$ (from $j = 1$ to $n$ ):
2. Apply the linear map to get its image: $Φ (b_{j}) \in W$ .
3. Express this image as a unique linear combination of the output basis vectors from $C$ : $Φ (b_{j}) = α_{1 j} c_{1} + α_{2 j} c_{2} + \dots + α_{m j} c_{m}$
4. The coefficients from this combination form the $j$ -th column of the matrix $A_{Φ}$ . This column is the coordinate vector of $Φ (b_{j})$ with respect to basis $C$ : $Column j of A_{Φ} = [Φ (b_{j})]_{C} = (\begin{matrix} α_{1 j} \\ α_{2 j} \\ ⋮ \\ α_{m j} \end{matrix})$
Usage and Interpretation:
This matrix maps the coordinate vector of any $v \in V$ (relative to basis $B$ ) to the coordinate vector of its image $Φ (v) \in W$ (relative to basis $C$ ). The core operational formula is:
$[Φ (v)]_{C} = A_{Φ} [v]_{B}$
This formula translates the abstract function application into a concrete matrix-vector multiplication. The matrix $A_{Φ}$ is the representation of the map $Φ$ with respect to the chosen bases; changing either basis will result in a different transformation matrix for the same underlying linear map.
Invertibility:
The linear map $Φ$ is an invertible isomorphism if and only if its transformation matrix $A_{Φ}$ is square ( $m = n$ ) and invertible. Non-Invertible Transformations and Information Loss

2. The Change of Basis Matrix

This is a special application of the transformation matrix, used to convert a vector's coordinates from one basis to another within the same vector space. This process is equivalent to finding the transformation matrix for the identity map ( $id : V \to V$ , where $id (x) = x$ ).

The Change of Basis Matrix ( $P_{B \leftarrow B^{'}}$ ):
This matrix converts coordinates from a new basis $B^{'}$ to an old basis $B$ . Its columns are the coordinate vectors of the new basis vectors, expressed in the old basis.
$P_{B \leftarrow B^{'}} = [[b_{1}^{'}]_{B} [b_{2}^{'}]_{B} \dots [b_{n}^{'}]_{B}]$
Note: If the "old" basis $B$ is the standard basis in $R^{n}$ , the columns of this matrix are simply the vectors of the new basis $B^{'}$ themselves.
Usage Formulas:
- To convert coordinates from new ( $B^{'}$ ) to old ( $B$ ): $[x]_{B} = P_{B \leftarrow B^{'}} [x]_{B^{'}}$
- To convert coordinates from old ( $B$ ) to new ( $B^{'}$ ): $[x]_{B^{'}} = (P_{B \leftarrow B^{'}})^{- 1} [x]_{B}$
Example: Change of Basis in $R^{2}$
- Old Basis (Standard): $B = (e_{1}, e_{2})$
- New Basis: $B^{'} = (b_{1} = (\begin{matrix} 1 \\ 1 \end{matrix}), b_{2} = (\begin{matrix} - 1 \\ 2 \end{matrix}))$
- Change of Basis Matrix from $B^{'}$ to $B$ : $P = (\begin{matrix} 1 & - 1 \\ 1 & 2 \end{matrix})$
- To express a vector $x = (\begin{matrix} 3 \\ 4 \end{matrix})$ (whose coordinates are given in the standard basis $B$ ) in the new basis $B^{'}$ : $[x]_{B^{'}} = P^{- 1} [x]_{B} = \frac{1}{3} (\begin{matrix} 2 & 1 \\ - 1 & 1 \end{matrix}) (\begin{matrix} 3 \\ 4 \end{matrix}) = (\begin{matrix} 10 / 3 \\ 1 / 3 \end{matrix})$

You are absolutely right, and I sincerely apologize. Your observation is spot on. I failed to integrate the clarifying distinction between "Domain/Codomain" (the spaces) and "Input/Output Basis" (the coordinate systems) into the formal note. You are correct that the note I provided, while technically accurate to the slides, lost the very explanation that resolved your earlier confusion.

3. The Theorem of Basis Change for Linear Mappings

Change-of-Basis Theorem
This theorem provides a formula to calculate the new transformation matrix for a linear map when the bases (the coordinate systems) for its domain and codomain are changed.

Theorem Statement:
Given a linear mapping $Φ : V \to W$ , with:
- An "old" input basis $B$ and a "new" input basis $B ̃$ , both for the domain $V$ .
- An "old" output basis $C$ and a "new" output basis $C ̃$ , both for the codomain $W$ .
- The original transformation matrix $A_{Φ}$ (relative to the old bases $B$ and $C$ ).
The new transformation matrix $A ̃_{Φ}$ (relative to the new bases $B ̃$ and $C ̃$ ) is given by:
$A ̃_{Φ} = T^{- 1} A_{Φ} S$
Where the change-of-basis matrices are defined as:
- $S$ : The matrix for the basis change within the domain $V$ . It converts coordinates from the new input basis $B ̃$ to the old input basis $B$ .
- $T$ : The matrix for the basis change within the codomain $W$ . It converts coordinates from the new output basis $C ̃$ to the old output basis $C$ .
Explanation of the Formula (The "Path" of the Transformation):
The formula represents a sequence of three operations on a coordinate vector. The path for the coordinates is from $B ̃ \to B \to C \to C ̃$ .
1. Step 1: S (from $B ̃ \to B$ in the Domain): We start with a vector's coordinates in the new input basis, $[v]_{B ̃}$ . We apply $S$ to translate these coordinates into the old input basis: $S [v]_{B ̃} = [v]_{B}$ .
2. Step 2: AΦ (from Basis $B$ to Basis $C$ ): We apply the original transformation matrix $A_{Φ}$ to the coordinates, which are now expressed in the old input basis $B$ . This yields the image's coordinates in the old output basis $C$ : $A_{Φ} ([v]_{B}) = [Φ (v)]_{C}$ .
3. Step 3: T⁻¹ (from $C \to C ̃$ in the Codomain): The result is in the old output basis $C$ . To express it in the new output basis $C ̃$ , we must apply the inverse of $T$ . Since $T$ converts from $C ̃$ to $C$ , $T^{- 1}$ must be used to convert from $C$ to $C ̃$ : $T^{- 1} [Φ (v)]_{C} = [Φ (v)]_{C ̃}$ .

4. Matrix Equivalence and Similarity

These concepts formalize the idea that different matrices can represent the same underlying linear map, just with different coordinate systems.

Matrix Equivalence:
- Definition: Two $m \times n$ matrices $A$ and $A ̃$ are equivalent if there exist invertible matrices $S$ (in the domain) and $T$ (in the codomain) such that $A ̃ = T^{- 1} A S$ .
- Interpretation: Equivalent matrices represent the exact same linear transformation $Φ : V \to W$ . They are merely different numerical representations of $Φ$ due to different choices of bases (coordinate systems) within the domain $V$ and the codomain $W$ .
Matrix Similarity:
- Definition: Two square $n \times n$ matrices $A$ and $A ̃$ are similar if there exists a single invertible matrix $S$ such that $A ̃ = S^{- 1} A S$ .
- Interpretation: Similarity is a special case of equivalence for endomorphisms ( $Φ : V \to V$ ), where the same space serves as both domain and codomain, and therefore the same basis change (i.e., $T = S$ ) is applied to both the input and output coordinates.

5. Composition of Linear Maps

Theorem: If $Φ : V \to W$ and $Ψ : W \to X$ are linear mappings, their composition $(Ψ \circ Φ) : V \to X$ is also a linear mapping.
Matrix Representation: The transformation matrix of the composite map is the product of the individual transformation matrices, in reverse order of application: $A_{Ψ \circ Φ} = A_{Ψ} A_{Φ}$

Part IV: Affine Spaces and Subspaces

While vector spaces and subspaces are fundamental, they are constrained by one critical requirement: they must contain the origin. Affine spaces generalize this idea to describe geometric objects like lines and planes that do not necessarily pass through the origin.

Core Intuition: Vector Space vs. Affine Space
- A vector subspace is a line or plane (or higher-dimensional equivalent) that must pass through the origin.
- An affine subspace is a line or plane (or higher-dimensional equivalent) that has been shifted or translated so it no longer needs to pass through the origin. It is a "flat" surface in the vector space.
Formal Definition of an Affine Subspace

An affine subspace $L$ of a vector space $V$ is a subset that can be expressed as the sum of a specific vector (a point) and a vector subspace.
$L = p + U = {p + u ∣ u \in U}$
Where:
- $p \in V$ is a specific vector, often called the translation vector or support point. It acts as the "anchor" that shifts the space.
- $U$ is a vector subspace of $V$ , often called the direction space or associated vector subspace. It defines the orientation and "shape" (line, plane, etc.) of the affine subspace.
The dimension of the affine subspace $L$ is defined as the dimension of its direction space $U$ .
Geometric Examples:
- A Line in $R^{3}$ : A line passing through point $p$ with direction vector $d$ is an affine subspace. $L = p + t d (t \in R)$ Here, the support point is $p$ , and the direction space is the 1D vector subspace $U = Span {d}$ .
- A Plane in $R^{3}$ : A plane containing point $p$ and parallel to vectors $u$ and $v$ (which are linearly independent) is an affine subspace. $L = p + s u + t v (s, t \in R)$ Here, the support point is $p$ , and the direction space is the 2D vector subspace $U = Span {u, v}$ .
Connection to Solutions of Linear Systems (Crucial Application)

Affine subspaces provide the perfect geometric description for the solution sets of linear systems.
- Homogeneous System Ax = 0: The set of all solutions to a homogeneous system is the Null Space of $A$ , denoted $N (A)$ . The null space is always a vector subspace.
- Non-Homogeneous System Ax = b: The set of all solutions to a non-homogeneous system (where $b \neq 0$ ) is an affine subspace.
  Recall the general solution formula:
  $x = x_{p} + x_{h}$
  Let's map this to the definition of an affine subspace $L = p + U$ :
  - The particular solution $x_{p}$ serves as the translation vector $p$ .
  - The set of all homogeneous solutions $x_{h}$ is the direction space $U$ . This is precisely the null space, $N (A)$ .
  Therefore, the complete solution set for Ax = b is the affine subspace:
  $L = x_{p} + N (A)$
  This means the solution set is the null space $N (A)$ shifted by a particular solution vector $x_{p}$ .
Summary of Key Differences

Feature	Vector Subspace (`U`)	Affine Subspace (`L = p + U`)
Must Contain Origin?	Yes. (`0 ∈ U`)	No, unless `p ∈ U`.
Closure under Addition?	Yes. If `u₁, u₂ ∈ U`, then `u₁ + u₂ ∈ U`.	No. In general, `l₁ + l₂ ∉ L`.
Closure under Scaling?	Yes. If `u ∈ U`, then `cu ∈ U`.	No. In general, `cl₁ ∉ L`.
Geometric Example	A line/plane through the origin.	Any line/plane, shifted.
Linear System Example	Solution set of `Ax = 0`.	Solution set of `Ax = b`.

Affine Combination
- A related concept is an affine combination. It is a linear combination where the coefficients sum to 1. $y = α_{1} x_{1} + α_{2} x_{2} + \dots + α_{k} x_{k} where \sum_{i = 1}^{k} α_{i} = 1$
- An affine subspace is closed under affine combinations. The set of all affine combinations of a set of points forms the smallest affine subspace containing them (their "affine span").
  [[如何用仿射子空间 (Affine Subspace) 的结构来理解线性方程组 Aλ = b 的通解]]

Part V: Hyperplanes

A hyperplane is a generalization of the concept of a line (in 2D) and a plane (in 3D) to vector spaces of any dimension. It is an extremely important and common special case of an affine subspace.

1. Core Intuition

In a 2D space ( $R^{2}$ ), a hyperplane is a line (which is 1-dimensional).
In a 3D space ( $R^{3}$ ), a hyperplane is a plane (which is 2-dimensional).
In an n-dimensional space ( $R^{n}$ ), a hyperplane is an (n-1)-dimensional "flat" subspace.

Its key function is to "slice" the entire space into two half-spaces, making it an ideal decision boundary in classification problems.

2. Two Equivalent Definitions of a Hyperplane

Hyperplanes can be defined in two equivalent ways: one algebraic and one geometric.

Definition 1: The Algebraic Definition (via a Single Linear Equation)

A hyperplane $H$ in $R^{n}$ is the set of all points $x$ that satisfy a single linear equation:

a_{1} x_{1} + a_{2} x_{2} + \dots + a_{n} x_{n} = d

where $a_{1}, \dots, a_{n}$ are coefficients that are not all zero, and d is a constant.

Using vector notation, this equation becomes much more compact:

a^{T} x = d

Normal Vector $a$ : The vector $a = (a_{1}, \dots, a_{n})^{T}$ is called the normal vector to the hyperplane. Geometrically, it is perpendicular to the hyperplane itself.
Offset d: The constant d determines the hyperplane's offset from the origin.
- If d = 0, the hyperplane aᵀx = 0 passes through the origin and is itself an (n-1)-dimensional vector subspace.
- If d ≠ 0, the hyperplane does not pass through the origin and is a true affine subspace.

Definition 2: The Geometric Definition (via Affine Subspaces)

A hyperplane $H$ in an n-dimensional vector space $V$ is an affine subspace of dimension n-1.

H = p + U

Where:

$p$ is any specific point on the hyperplane (the support point).
$U$ is a vector subspace of dimension n-1 (the direction space).

3. The Connection Between the Definitions

These two definitions are perfectly equivalent.

From Algebraic to Geometric (aᵀx = d → p + U):
1. Direction Space U: The direction space U is the parallel hyperplane that passes through the origin. It is the set of all vectors u that satisfy aᵀu = 0. This set is the orthogonal complement of the normal vector a and has dimension n-1.
2. Support Point p: We can find a support point p by finding any particular solution to the equation aᵀx = d.
Example: Consider the plane $2 x_{1} + 3 x_{2} + 4 x_{3} = 12$ in $R^{3}$ .
- Algebraic Form: Normal vector $a = (2, 3, 4)^{T}$ , offset $d = 12$ .
- Geometric Form:
  - Find a support point p: Let $x_{2} = 0, x_{3} = 0$ . Then $2 x_{1} = 12 ⟹ x_{1} = 6$ . So, a point on the plane is $p = (6, 0, 0)^{T}$ .
  - Find the direction space U: U is the set of all vectors u such that $2 u_{1} + 3 u_{2} + 4 u_{3} = 0$ . This is a 2-dimensional plane passing through the origin.
  - The hyperplane can thus be written as $H = (\begin{matrix} 6 \\ 0 \\ 0 \end{matrix}) + Span {(\begin{matrix} - 3 / 2 \\ 1 \\ 0 \end{matrix}), (\begin{matrix} - 2 \\ 0 \\ 1 \end{matrix})}$ , where the two vectors in the span form a basis for U.

4. Hyperplanes in Machine Learning

Hyperplanes are at the core of many machine learning algorithms, most famously the Support Vector Machine (SVM).

As a Decision Boundary: In a binary classification problem, the goal is to find a hyperplane that best separates data points belonging to two different classes.
The SVM Hyperplane: An SVM seeks to find an optimal hyperplane defined by the equation: $w^{T} x - b = 0$
- $w$ is the weight vector, which is equivalent to the normal vector a.
- b is the bias term, which is related to the offset d.
The Classification Rule:
- If a new data point $x_{new}$ satisfies $w^{T} x_{new} - b > 0$ , it is assigned to one class (e.g., the positive class).
- If it satisfies $w^{T} x_{new} - b < 0$ , it is assigned to the other class (e.g., the negative class).
- This means the classification of a point is determined by which side of the hyperplane it lies on. The goal of an SVM is to find the w and b that make this separating "margin" as wide as possible.

Part VI: Affine Mappings

We have established that linear mappings, of the form φ(x) = Ax, always preserve the origin (i.e., φ(0) = 0). However, many practical applications, especially in computer graphics, require transformations that include translation, which moves the origin. This more general class of transformation is called an affine mapping.

1. Core Idea: A Linear Map Followed by a Translation

An affine mapping is, in essence, a composition of a linear mapping and a translation.

Linear Part: Handles rotation, scaling, shearing, and other transformations that keep the origin fixed.
Translation Part: Shifts the entire result to a new location in the space.

2. Formal Definition

A mapping f: V → W from a vector space V to a vector space W is called an affine mapping if it can be written in the form:

f (x) = A x + b

Where:

A is an $m \times n$ matrix representing the linear part of the transformation.
b is an $m \times 1$ vector representing the translation part.

Distinction from Linear Mappings:

If the translation vector b = 0, the affine map degenerates into a purely linear map.
If b ≠ 0, then f(0) = A(0) + b = b, which means the origin is no longer mapped to the origin but is moved to the position defined by b.

3. Key Properties of Affine Mappings

While affine maps are generally not linear (since f(x+y) ≠ f(x) + f(y)), they preserve several crucial geometric properties.

Lines Map to Lines: An affine map transforms a straight line into another straight line (or, in a degenerate case, a single point if the line's direction is in the null space of A).
Parallelism is Preserved: If two lines are parallel, their images under an affine map will also be parallel.
Ratios of Lengths are Preserved: If a point P is the midpoint of a line segment QR, then its image f(P) will be the midpoint of the image segment f(Q)f(R). This property is vital for maintaining the relative structure of geometric shapes.
Affine Combinations are Preserved: This is the most fundamental algebraic property of an affine map. If a point y is an affine combination of a set of points xᵢ (meaning y = Σαᵢxᵢ where Σαᵢ = 1), then its image f(y) is the same affine combination of the images f(xᵢ):
$f (\sum α_{i} x_{i}) = \sum α_{i} f (x_{i}), provided that \sum α_{i} = 1$

4. Homogeneous Coordinates: The Trick to Unify Transformations

In fields like computer graphics, it is highly desirable to represent all transformations, including translations, with a single matrix multiplication. The standard form Ax + b requires both a multiplication and an addition, which is inconvenient for composing multiple transformations.

Homogeneous Coordinates elegantly solve this problem by adding an extra dimension, effectively turning an affine map into a linear map in a higher-dimensional space.

How it Works:
1. An n-dimensional vector x = (x₁, ..., xₙ)ᵀ is represented as an (n+1)-dimensional homogeneous vector: $x_{hom} = [\begin{matrix} x_{1} \\ ⋮ \\ x_{n} \\ 1 \end{matrix}]$
2. An affine map f(x) = Ax + b is represented by an (n+1) × (n+1) augmented transformation matrix: $T_{f} = [\begin{matrix} A & b \\ 0 & \dots & 0 & 1 \end{matrix}]$ Here, A is the $n \times n$ linear part, and b is the $n \times 1$ translation vector. The bottom row consists of zeros followed by a one.
The Unified Operation:
The affine transformation can now be performed with a single matrix multiplication. The notation below shows the block matrix multiplication explicitly:
$T_{f} x_{hom} = [\begin{matrix} A & b \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} x \\ 1 \end{matrix}] = [\begin{matrix} A x + b (1) \\ 0^{T} x + 1 (1) \end{matrix}] = [\begin{matrix} A x + b \\ 1 \end{matrix}]$
The resulting vector's first n components are exactly the desired Ax + b, and the final component remains 1.
The Advantage:
This technique allows a sequence of transformations (e.g., a rotation, then a scaling, then a translation) to be composed by first multiplying their respective augmented matrices. The resulting single matrix can then be applied to all points, dramatically simplifying the computation and management of complex transformations.

5. Summary

Concept	Linear Mapping (`Ax`)	Affine Mapping (`Ax + b`)
Essence	Rotation / Scaling / Shearing	Linear Transformation + Translation
Preserves Origin?	Yes, `f(0) = 0`	No, `f(0) = b` in general
Preserves Lin. Comb.?	Yes	No
What is Preserved?	Lines, parallelism, linear combinations	Lines, parallelism, affine combinations
Representation	Matrix `A`	Matrix `A` and vector `b`
Homogeneous Form	$[\begin{matrix} A & 0 \\ 0^{T} & 1 \end{matrix}]$	$[\begin{matrix} A & b \\ 0^{T} & 1 \end{matrix}]$

Lecture 3: Analytic Geometry: Norms, Inner Products, and Lengths and Distances, Angles and Orthogonality

Part I: Geometric Structures on Vector Spaces

In the previous parts, we established the algebraic framework of vector spaces and linear mappings. Now, we will enrich these spaces with geometric structure, allowing us to formalize intuitive concepts like the length of a vector, the distance between vectors, and the angle between them. These concepts are captured by norms and inner products.

1. Norms

A norm is a formal generalization of the intuitive notion of a vector's "length" or "magnitude".

Geometric Intuition: The norm of a vector is its length, i.e., the distance from the origin to the point the vector represents.
Formal Definition of a Norm:
A norm on a vector space $V$ is a function $∥ \cdot ∥ : V \to R$ that assigns a non-negative real value $∥ x ∥$ to every vector $x \in V$ . This function must satisfy the following three axioms for all vectors $x, y \in V$ and any scalar $λ \in R$ :
1. Positive Definiteness: The length is positive, except for the zero vector.
  - $∥ x ∥ \geq 0$
  - $∥ x ∥ = 0 ⟺ x = 0$
2. Absolute Homogeneity: Scaling a vector scales its length by the same factor.
  - $∥ λ x ∥ = | λ | ∥ x ∥$
3. Triangle Inequality: The length of one side of a triangle is no greater than the sum of the lengths of the other two sides.
  - $∥ x + y ∥ \leq ∥ x ∥ + ∥ y ∥$
A vector space equipped with a norm is called a normed vector space.
Examples of Norms on $R^{n}$ :
The idea of "length" can be defined in multiple ways. For a vector $x = (x_{1}, \dots, x_{n})^{T}$ :
- The $L_{1}$ -norm (Manhattan Norm): Measures the "city block" distance.
  $∥ x ∥_{1} := \sum_{i = 1}^{n} | x_{i} |$
- The $L_{2}$ -norm (Euclidean Norm): The standard "straight-line" distance, derived from the Pythagorean theorem.
  $∥ x ∥_{2} := \sqrt{\sum_{i = 1}^{n} x_{i}^{2}} = \sqrt{x^{T} x}$
  This is the default norm. When we write $∥ x ∥$ without a subscript, we almost always mean the Euclidean norm.
- The $L_{\infty}$ -norm (Maximum Norm): The length is determined by the largest component of the vector.
  $∥ x ∥_{\infty} := {max}_{i = 1}^{n} | x_{i} |$
Distance Derived from a Norm:
Any norm naturally defines a distance $d (x, y)$ between two vectors as the norm of their difference vector:
$d (x, y) := ∥ x - y ∥$

2. Inner Products

An inner product is a more fundamental concept than a norm. It is a function that allows us to define not only the Euclidean norm but also the angle between vectors and the notion of orthogonality (perpendicularity).

Motivation: The inner product is a generalization of the familiar dot product in $R^{n}$ , which is defined as:
$⟨ x, y ⟩ = x^{T} y = \sum_{i = 1}^{n} x_{i} y_{i}$
Formal Definition of an Inner Product:
An inner product on a real vector space $V$ is a function $⟨ \cdot, \cdot ⟩ : V \times V \to R$ that takes two vectors and returns a scalar. The function must be a symmetric, positive-definite bilinear map, satisfying the following axioms for all $x, y, z \in V$ and any scalar $λ \in R$ :
1. Bilinearity: The function is linear in each argument.
  - Linearity in the first argument: $⟨ λ x + y, z ⟩ = λ ⟨ x, z ⟩ + ⟨ y, z ⟩$
  - Linearity in the second argument: $⟨ x, λ y + z ⟩ = λ ⟨ x, y ⟩ + ⟨ x, z ⟩$
2. Symmetry: The order of the arguments does not matter.
  $⟨ x, y ⟩ = ⟨ y, x ⟩$
3. Positive Definiteness: The inner product of a vector with itself is non-negative, and is zero only for the zero vector.
  - $⟨ x, x ⟩ \geq 0$
  - $⟨ x, x ⟩ = 0 ⟺ x = 0$
A vector space equipped with an inner product is called an inner product space.

3. The Bridge: From Inner Products to Geometry

The inner product is the foundation of Euclidean geometry within a vector space. All key geometric concepts can be derived from it.

The Induced Norm:
Every inner product naturally defines (or induces) a norm given by:
$∥ x ∥ := \sqrt{⟨ x, x ⟩}$
It can be proven that this definition satisfies all three norm axioms. The standard Euclidean norm $∥ x ∥_{2}$ is precisely the norm induced by the standard dot product.
The Cauchy-Schwarz Inequality:
This is one of the most important inequalities in mathematics. It relates the inner product of two vectors to their induced norms and provides the foundation for defining angles.
$| ⟨ x, y ⟩ | \leq ∥ x ∥ ∥ y ∥$
Geometric Concepts Defined by the Inner Product:
- Length: $∥ x ∥ = \sqrt{⟨ x, x ⟩}$
- Distance: $d (x, y) = ∥ x - y ∥ = \sqrt{⟨ x - y, x - y ⟩}$
- Angle: The angle $θ$ between two non-zero vectors $x$ and $y$ is defined via:
  $\cos θ = \frac{⟨ x, y ⟩}{∥ x ∥ ∥ y ∥}$
  (The Cauchy-Schwarz inequality guarantees that the right-hand side is between -1 and 1, so $θ$ is well-defined).
- Orthogonality (Perpendicularity):
  - Definition: Two vectors $x$ and $y$ are orthogonal if their inner product is zero. We denote this as $x ⊥ y$ . $x ⊥ y ⟺ ⟨ x, y ⟩ = 0$
  - Geometric Meaning: If the inner product is the standard dot product, orthogonality means the vectors are perpendicular (the angle between them is 90° or $π / 2$ radians).
  - Pythagorean Theorem: If $x ⊥ y$ , then the familiar Pythagorean theorem holds: $∥ x + y ∥^{2} = ∥ x ∥^{2} + ∥ y ∥^{2}$

Part II: Geometric Structures on Vector Spaces

1. Symmetric, Positive Definite (SPD) Matrices and Inner Products

In finite-dimensional vector spaces like $R^{n}$ , the abstract concept of an inner product can be concretely represented and computed using a special class of matrices. These are Symmetric, Positive Definite (SPD) matrices. They are fundamental in machine learning, statistics, and optimization because they provide a way to define custom, yet valid, notions of distance, angle, and similarity, which are essential for algorithms like Support Vector Machines (with kernels), Gaussian models, and Newton's method in optimization.

Definition of a Symmetric, Positive Definite Matrix:
A square matrix $A \in R^{n \times n}$ is called symmetric, positive definite (SPD) if it satisfies two conditions:
1. Symmetry: The matrix is equal to its transpose. $A = A^{T}$
2. Positive Definiteness: The quadratic form $x^{T} A x$ is strictly positive for every non-zero vector $x \in R^{n}$ .Quadratic Form $x^{T} A x > 0 for all x \in R^{n}, x \neq 0$
The Central Theorem: The Matrix Representation of Inner Products
The deep connection between algebra and geometry is captured by the following theorem, which states that inner products and SPD matrices are two sides of the same coin.

Theorem: Let $V$ be an $n$ -dimensional real vector space with an ordered basis $B$ . Let $\hat{x}$ and $\hat{y}$ be the coordinate vectors of $x, y \in V$ with respect to basis $B$ . A function defined by
$⟨ x, y ⟩ := {\hat{x}}^{T} A \hat{y}$
is a valid inner product on $V$ if and only if the matrix $A \in R^{n \times n}$ is symmetric and positive definite.

Explanation:
- This theorem provides a universal recipe for all possible inner products on a finite-dimensional space.
- The standard dot product in $R^{n}$ is the simplest case of this theorem, where the matrix $A$ is the identity matrix $I$ : $⟨ x, y ⟩ = x^{T} I y = x^{T} y$
- More importantly, any SPD matrix $A$ can be used to define a new, perfectly valid inner product $⟨ \cdot, \cdot ⟩_{A}$ on $R^{n}$ . This new inner product defines a new geometry on the space, with its own corresponding notions of length $(∥ x ∥_{A} = \sqrt{x^{T} A x})$ and orthogonality $(x^{T} A y = 0)$ .
Properties of SPD Matrices
If a matrix $A$ is symmetric and positive definite, it has several important properties that follow directly from its definition:
1. Invertibility (Trivial Null Space): An SPD matrix is always invertible. Its null space (or kernel) contains only the zero vector.
  - Proof: Suppose there exists a non-zero vector $x$ such that $A x = 0$ . Then multiplying by $x^{T}$ gives $x^{T} A x = x^{T} 0 = 0$ . This contradicts the positive definiteness condition, which states that $x^{T} A x$ must be strictly greater than 0 for any non-zero $x$ . Therefore, no such non-zero $x$ can exist, and the only solution to $A x = 0$ is $x = 0$ .
2. Positive Diagonal Elements: All the diagonal elements of an SPD matrix are strictly positive.
  - Proof: To find the $i$ -th diagonal element, $a_{i i}$ , we can choose the standard basis vector $e_{i}$ (which has a 1 in the $i$ -th position and 0s elsewhere). Since $e_{i}$ is a non-zero vector, we must have $e_{i}^{T} A e_{i} > 0$ . But $e_{i}^{T} A e_{i}$ is precisely the element $a_{i i}$ . Thus, $a_{i i} > 0$ for all $i$ .
Recap: Inner Product vs. Dot Product
It is crucial to distinguish between the general concept and its most common example:
- Inner Product $⟨ x, y ⟩$ : This is the general concept. It is any function that is bilinear, symmetric, and positive definite.
- Dot Product $x^{T} y$ : This is a specific example of an inner product in $R^{n}$ . It is the inner product defined by the identity matrix, $A = I$ .
- Euclidean Norm $∥ x ∥_{2}$ : This is the norm that is induced by the dot product: $∥ x ∥_{2} = \sqrt{x^{T} x}$ . Other inner products (defined by other SPD matrices $A$ ) induce different norms.
  You are absolutely right. My apologies for that oversight. I failed to maintain the English language consistency in the last response.

Part III: Angles, Orthogonality, and Orthogonal Matrices

1. Angles and Orthogonality

Inner products not only define lengths and distances but also enable us to define the angle between vectors, thereby generalizing the concept of "perpendicularity".

Defining the Angle:
- The Cauchy-Schwarz inequality guarantees that for any non-zero vectors $x$ and $y$ : $- 1 \leq \frac{⟨ x, y ⟩}{∥ x ∥ ∥ y ∥} \leq 1$
- This ensures that we can uniquely define an angle $ω \in [0, π]$ (i.e., from 0° to 180°) such that: $\cos ω = \frac{⟨ x, y ⟩}{∥ x ∥ ∥ y ∥}$
- This $ω$ is defined as the angle between vectors $x$ and $y$ . It measures the similarity of their orientation.
Orthogonality:
- Definition: Two vectors $x$ and $y$ are orthogonal if their inner product is zero. This is denoted as $x ⊥ y$ . $x ⊥ y ⟺ ⟨ x, y ⟩ = 0$
- Geometric Meaning: When $⟨ x, y ⟩ = 0$ , it follows that $\cos ω = 0$ , which means the angle $ω = π / 2$ (90°). Therefore, orthogonality is a direct generalization of the geometric concept of "perpendicular".
- Important Corollary: The zero vector $0$ is orthogonal to every vector, since $⟨ 0, x ⟩ = 0$ .
Orthonormality:
- Two vectors $x$ and $y$ are orthonormal if they are both orthogonal ( $⟨ x, y ⟩ = 0$ ) and are unit vectors ( $∥ x ∥ = 1$ , $∥ y ∥ = 1$ ).
Key Point: Orthogonality Depends on the Inner Product
- Just like length, whether two vectors are orthogonal depends entirely on the chosen inner product.
- For example, the vectors $x = [1, 1]^{T}$ and $y = [- 1, 1]^{T}$ are orthogonal under the standard dot product ( $x^{T} y = 0$ ), but they may not be orthogonal under a different inner product defined by an SPD matrix $A$ , such as $⟨ x, y ⟩_{A} = x^{T} A y$ .

2. Orthogonal Matrices

An orthogonal matrix is a special type of square matrix whose corresponding linear transformation geometrically represents a shape-preserving transformation (like a rotation or reflection) and has excellent computational properties.

Definition:
A square matrix $A \in R^{n \times n}$ is called an orthogonal matrix if and only if its columns form an orthonormal set.
Equivalent Properties:
The following statements are equivalent and are often used as practical tests for orthogonality:
1. The columns of $A$ are orthonormal.
2. $A^{T} A = I$
3. $A^{- 1} = A^{T}$
- Core Idea: The inverse of an orthogonal matrix is simply its transpose. This makes the computationally expensive operation of inversion trivial.
Geometric Properties: Preserving Lengths and Angles
A linear transformation $T (x) = A x$ defined by an orthogonal matrix $A$ is a rigid transformation, meaning it does not alter the geometry of the space.
1. Preserves Lengths:
  $∥ A x ∥^{2} = (A x)^{T} (A x) = x^{T} A^{T} A x = x^{T} I x = x^{T} x = ∥ x ∥^{2}$
  Therefore, $∥ A x ∥ = ∥ x ∥$ .
2. Preserves Inner Products and Angles:
  $⟨ A x, A y ⟩ = (A x)^{T} (A y) = x^{T} A^{T} A y = x^{T} I y = ⟨ x, y ⟩$
  Since the inner product is preserved, the angle defined by it is also preserved.
Practical Meaning and Applications:
- Geometric Models: Orthogonal matrices perfectly model rotations and reflections in space.
- Numerical Stability: Algorithms involving orthogonal matrices (like QR decomposition) are typically very numerically stable.
- Change of Coordinate Systems: The transformation from one orthonormal basis to another is described by an orthogonal matrix.
- Construction: The Gram-Schmidt process can be used to construct an orthonormal basis from any set of linearly independent vectors, which can then be used as the columns of an orthogonal matrix.

Part IV: Metric Spaces and the Formal Definition of Distance

Previously, we defined distance in an inner product space as d(x, y) = ||x - y||. This is a specific instance of a much more general concept called a metric. A metric formalizes the intuitive notion of "distance" between elements of any set, not just vectors.

1. The Metric Function (度量函数)

A metric (or distance function) on a set V is a function that quantifies the distance between any two elements of that set.

Formal Definition:
A metric on a set V is a function d : V × V → ℝ that maps a pair of elements (x, y) to a real number d(x, y), satisfying the following three axioms for all x, y, z ∈ V:
1. Positive Definiteness (正定性):
  - d(x, y) ≥ 0 (Distance is always non-negative).
  - d(x, y) = 0 if and only if x = y (The distance from an element to itself is zero, and it's the only case where the distance is zero).
2. Symmetry (对称性):
  - d(x, y) = d(y, x) (The distance from x to y is the same as the distance from y to x).
3. Triangle Inequality (三角不等式):
  - d(x, z) ≤ d(x, y) + d(y, z) (The direct path is always the shortest; going from x to z via y is at least as long).
Metric Space (度量空间):
A set V equipped with a metric d is called a metric space, denoted as (V, d).

2. The Connection: From Inner Products to Metrics

The distance we derived from the inner product is a valid metric. We can prove this by showing it satisfies all three metric axioms.

Theorem: The distance function d(x, y) = ||x - y|| = √⟨x-y, x-y⟩ induced by an inner product is a metric.
Proof:
1. Positive Definiteness:
  - d(x, y) = ||x - y||. By the positive definiteness of norms, ||x - y|| ≥ 0.
  - Equality ||x - y|| = 0 holds if and only if x - y = 0, which means x = y. The axiom holds.
2. Symmetry:
  - d(x, y) = ||x - y||. Using the properties of norms, ||x - y|| = ||(-1)(y - x)|| = |-1| ||y - x|| = ||y - x|| = d(y, x). The axiom holds.
3. Triangle Inequality:
  - d(x, z) = ||x - z||. We can rewrite the argument as x - z = (x - y) + (y - z).
  - By the triangle inequality for norms, ||(x - y) + (y - z)|| ≤ ||x - y|| + ||y - z||.
  - Substituting back the distance definition, we get d(x, z) ≤ d(x, y) + d(y, z). The axiom holds.

3. Why is the Concept of a Metric Useful?

The power of defining a metric abstractly is that it allows us to measure "distance" in contexts far beyond standard Euclidean geometry.

Generalization: It applies to any set, including:
- Strings: The edit distance (or Levenshtein distance) between two strings (e.g., "apple" and "apply") is a metric. It counts the minimum number of edits (insertions, deletions, substitutions) needed to change one string into the other.
- Graphs: The shortest path distance between two nodes in a graph is a metric.
- Functions: We can define metrics to measure how "far apart" two functions are.
Foundation for Other Fields: The concept of a metric space is foundational for topology, analysis, and many areas of machine learning. For example, the k-Nearest Neighbors (k-NN) algorithm can work with any valid metric to find the "closest" data points, not just Euclidean distance.

4. Summary: Hierarchy of Spaces

This shows how these geometric concepts build upon each other:

Inner Product Space → Normed Space → Metric Space → Topological Space

Every inner product induces a norm.
Every norm induces a metric.
Every metric induces a topology (a notion of "open sets" and "closeness").

However, the reverse is not always true. There are metrics (like edit distance) that do not come from a norm, and norms (like the L1-norm) that do not come from an inner product.

Part V: Orthogonal Projections

Orthogonal projection is a fundamental operation that finds the "closest" vector in a subspace to a given vector in the larger space. It is the geometric foundation for concepts like least-squares approximation.

1. The Concept of Orthogonal Projection

Let $U$ be a subspace of an inner product space $V$ (e.g., $R^{n}$ with the dot product), and let $x \in V$ be a vector. The orthogonal projection of $x$ onto the subspace $U$ , denoted $π_{U} (x)$ , is the unique vector in $U$ that is "closest" to $x$ .

This projection, which we will call $p = π_{U} (x)$ , is defined by two fundamental properties:

Membership Property: The projection $p$ must lie within the subspace $U$ .
- ( $p \in U$ )
Orthogonality Property: The vector connecting the original vector $x$ to its projection $p$ (the "error" vector $x - p$ ) must be orthogonal to the entire subspace $U$ .
- ( $(x - p) ⊥ U$ )

This second property implies that $(x - p)$ must be orthogonal to every vector in $U$ . A key theorem states that this is equivalent to being orthogonal to all vectors in a basis for $U$ .

2. Deriving the Projection Formula (The Normal Equation)

Our goal is to find an algebraic method to compute the projection vector $p$ . The strategy is to translate the two geometric properties above into a system of linear equations.

Step 1: Express the Membership Property using a Basis

First, we need a basis for the subspace $U$ . Let's say we find a basis ${b_{1}, b_{2}, \dots, b_{k}}$ . We can arrange these basis vectors as the columns of a matrix $B$ :

B = [\begin{matrix} | & | & | \\ b_{1} & b_{2} & \dots & b_{k} \\ | & | & | \end{matrix}]

Since the projection $p$ must be in the subspace $U$ (which is the column space of $B$ ), $p$ must be a linear combination of the columns of $B$ . This means there must exist a unique vector of coefficients $λ = (λ_{1}, \dots, λ_{k})^{T}$ such that:

p = λ_{1} b_{1} + λ_{2} b_{2} + \dots + λ_{k} b_{k}

In matrix form, this is written as:

p = B λ

Our problem is now reduced from finding the vector $p$ to finding the unknown coefficient vector $λ$ .

Step 2: Express the Orthogonality Property as an Equation

The orthogonality property states that $(x - p) ⊥ U$ . This means the dot product of $(x - p)$ with every basis vector of $U$ must be zero:

{\begin{cases} b_{1} \cdot (x - p) = 0 \\ b_{2} \cdot (x - p) = 0 \\ ⋮ \\ b_{k} \cdot (x - p) = 0 \end{cases} ⟺ {\begin{cases} b_{1}^{T} (x - p) = 0 \\ b_{2}^{T} (x - p) = 0 \\ ⋮ \\ b_{k}^{T} (x - p) = 0 \end{cases}

This system of equations can be written compactly in matrix form. Notice that the rows of the matrix are the transposes of our basis vectors, which is exactly the matrix $B^{T}$ :

[\begin{matrix} --- b_{1}^{T} --- \\ --- b_{2}^{T} --- \\ ⋮ \\ --- b_{k}^{T} --- \end{matrix}] (x - p) = 0 ⟹ B^{T} (x - p) = 0

Step 3: Combine and Solve for λ

We now have a system of two equations:

$p = B λ$
$B^{T} (x - p) = 0$

Substitute the first equation into the second:

B^{T} (x - B λ) = 0

Distribute $B^{T}$ :

B^{T} x - B^{T} (B λ) = 0

Rearrange the terms to isolate the unknown $λ$ :

(B^{T} B) λ = B^{T} x

This final result is known as the Normal Equation. It is a system of linear equations for the unknown coefficients $λ$ .

3. The Algorithm for Orthogonal Projection

Given a vector $x$ and a subspace $U$ :

Find a Basis: Find a basis ${b_{1}, \dots, b_{k}}$ for the subspace $U$ .
Form the Basis Matrix B: Create a matrix $B$ whose columns are the basis vectors.
Set up the Normal Equation: Compute the matrix BᵀB and the vector Bᵀx.
Solve for λ: Solve the linear system (BᵀB)λ = Bᵀx to find the coefficient vector λ.
Compute the Projection p: Calculate the final projection vector using the formula p = Bλ.

Important Note: The matrix BᵀB is square and is invertible if and only if the columns of B are linearly independent (which they are, because they form a basis).

4. Special Case: Orthonormal Basis

If the basis for $U$ is orthonormal, the columns of $B$ are orthonormal. In this case, the calculation simplifies dramatically:

$B^{T} B = I$ (the identity matrix).
The Normal Equation becomes: $I λ = B^{T} x ⟹ λ = B^{T} x$ .
The projection formula becomes: $p = B λ = B (B^{T} x) = (B B^{T}) x$ .

The matrix $P = B B^{T}$ is called the projection matrix. This simplified formula only works when the basis is orthonormal. For a general, non-orthogonal basis, one must solve the full Normal Equation.

Lecture 4: Analytic Geometry: Orthonormal Basis, Orthogonal Complement, Inner Product of Functions, Orthogonal Projections, Rotations

Part I: Orthonormal Basis and Orthogonal Complement

1. Orthonormal Basis

Foundation: In an n-dimensional vector space, a basis is a set of n linearly independent vectors. The inner product allows us to define geometric concepts like length and angle.
Definition: An orthonormal basis is a special type of basis where all basis vectors are mutually orthogonal (perpendicular) and each basis vector is a unit vector (has a length of 1).
- Formally: For a basis ${b_{1}, \dots, b_{n}}$ of a vector space $V$ :
  - Orthogonality: $⟨ b_{i}, b_{j} ⟩ = 0$ for all $i \neq j$ .
  - Normalization: $⟨ b_{i}, b_{i} ⟩ = ∥ b_{i} ∥^{2} = 1$ for all $i$ .
- Orthogonal Basis: If only the orthogonality condition holds (vectors are perpendicular but not necessarily unit length), the basis is called an orthogonal basis.
Canonical Example: The standard Cartesian basis in $R^{n}$ (e.g., ${e_{1}, e_{2}, e_{3}}$ in $R^{3}$ ) is the most common example of an orthonormal basis.
Example in $R^{2}$ : The vectors $b_{1} = \frac{1}{\sqrt{2}} (\begin{matrix} 1 \\ 1 \end{matrix})$ and $b_{2} = \frac{1}{\sqrt{2}} (\begin{matrix} 1 \\ - 1 \end{matrix})$ form an orthonormal basis because $b_{1}^{T} b_{2} = 0$ and $∥ b_{1} ∥ = ∥ b_{2} ∥ = 1$ .

2. Gram-Schmidt Process: Constructing an Orthonormal Basis

The Gram-Schmidt process is a fundamental algorithm that transforms any set of linearly independent vectors (a basis) into an orthonormal basis for the same subspace.

Goal: Given a basis ${a_{1}, \dots, a_{n}}$ , produce an orthonormal basis ${q_{1}, \dots, q_{n}}$ .
Algorithm Steps:
1. Initialize: Start with the first vector. Normalize it to get the first orthonormal basis vector. $q_{1} = \frac{a_{1}}{∥ a_{1} ∥}$
2. Iterate and Orthogonalize: For each subsequent vector $a_{k}$ (from $k = 2$ to $n$ ):
  a. Project and Subtract: Calculate the projection of $a_{k}$ onto the subspace already spanned by the previously found orthonormal vectors ${q_{1}, \dots, q_{k - 1}}$ . Subtract this projection from $a_{k}$ to get a vector $v_{k}$ that is orthogonal to that subspace.
  $$ \mathbf{v}_k = \mathbf{a}k - \sum^{k-1} \langle \mathbf{a}_k, \mathbf{q}_j \rangle \mathbf{q}_j $$
  b. Normalize: Normalize the resulting orthogonal vector $v_{k}$ to make it a unit vector.
  $$ \mathbf{q}_k = \frac{\mathbf{v}_k}{|\mathbf{v}_k|} $$
Result: The set ${q_{1}, \dots, q_{n}}$ is an orthonormal basis for the same space spanned by the original vectors ${a_{1}, \dots, a_{n}}$ .

3. Orthogonal Complement and Decomposition

The orthogonal complement generalizes the concept of perpendicularity from single vectors to entire subspaces.

Definition (Orthogonal Complement): Let $U$ be a subspace of a vector space $V$ . The orthogonal complement of $U$ , denoted $U^{⊥}$ , is the set of all vectors in $V$ that are orthogonal to every vector in $U$ .
$U^{⊥} = {v \in V ∣ ⟨ v, u ⟩ = 0 for all u \in U}$
Space Decomposition (Direct Sum): The entire space $V$ can be uniquely decomposed into the direct sum of the subspace $U$ and its orthogonal complement $U^{⊥}$ . This implies that the intersection of $U$ and $U^{⊥}$ contains only the zero vector, and the sum of their dimensions equals the dimension of $V$ .
$V = U \oplus U^{⊥} and \dim (U) + \dim (U^{⊥}) = \dim (V)$
Vector Decomposition (Orthogonal Decomposition): Based on the direct sum decomposition of the space, any vector $x$ in $V$ can be uniquely decomposed into the sum of a component in $U$ and a component in $U^{⊥}$ .
- Conceptually: $x = x_{U} + x_{U^{⊥}} (where x_{U} \in U, x_{U^{⊥}} \in U^{⊥})$
- Basis-Level Representation: Computationally, this decomposition is expressed by finding the unique coordinates of $x$ with respect to a basis for $U$ and a basis for $U^{⊥}$ . $x = \sum_{m = 1}^{M} λ_{m} b_{m} + \sum_{j = 1}^{D - M} ψ_{j} b_{j}^{⊥}$ where ${b_{1}, \dots, b_{M}}$ is a basis for $U$ , ${b_{1}^{⊥}, \dots, b_{D - M}^{⊥}}$ is a basis for $U^{⊥}$ , and $λ_{m}, ψ_{j}$ are the unique scalar coordinates.
- The component $x_{U} = \sum λ_{m} b_{m}$ is precisely the orthogonal projection of $x$ onto the subspace $U$ .

4. Applications and Examples of Orthogonal Complements

Geometric Interpretation: Orthogonal complements provide a powerful way to describe geometric objects.
- In $R^{3}$ , the orthogonal complement of a plane (a 2D subspace) is the line perpendicular to it (a 1D subspace), often called the normal line. The basis vector for this line is the plane's normal vector.
- More generally, in $R^{n}$ , the orthogonal complement of a hyperplane (an (n-1)-dimensional subspace) is a line, and vice-versa.
Example in $R^{3}$ :
- Let $U = span {(\begin{matrix} 1 \\ 0 \\ 0 \end{matrix}), (\begin{matrix} 0 \\ 1 \\ 0 \end{matrix})}$ be the xy-plane.
- Its orthogonal complement $U^{⊥}$ is the set of all vectors orthogonal to the xy-plane, which is the z-axis: $U^{⊥} = span {(\begin{matrix} 0 \\ 0 \\ 1 \end{matrix})}$ .
- We can see that $\dim (U) + \dim (U^{⊥}) = 2 + 1 = 3 = \dim (R^{3})$ .
Clarification on Dimension: It's important to remember that if $U$ is a proper subspace of $V$ , its dimension must be strictly less than the dimension of $V$ ( $\dim (U) < \dim (V)$ ), even though the vectors in $U$ have the same number of coordinates as the vectors in $V$ .

Part II: Inner Product of Functions, Orthogonal Projections, and Rotations

1. Inner Product of Functions

The concept of an inner product can be extended from the familiar space $R^{n}$ to vector spaces of functions, enabling us to apply geometric intuition to abstract objects like polynomials or signals.

Definition: For the vector space of continuous functions on an interval $[a, b]$ , the standard inner product between two functions $f (x)$ and $g (x)$ is defined by the integral: $⟨ f, g ⟩ := \int_{a}^{b} f (x) g (x) d x$
Induced Geometry: This definition allows us to measure:
- Length (Norm): $∥ f ∥ = \sqrt{\int_{a}^{b} f (x)^{2} d x}$
- Distance: $d (f, g) = ∥ f - g ∥$
- Orthogonality: Two functions are orthogonal if $⟨ f, g ⟩ = 0$ .
Significance: This generalization is crucial for fields like signal processing and quantum mechanics, and it forms the mathematical basis for Fourier series, which decompose functions into a sum of orthogonal sine and cosine functions.Fourier Series

2. Orthogonal Projections

Orthogonal projection is a core operation for finding the "best approximation" or "closest point" of a vector within a given subspace. It is the geometric foundation for solving least-squares problems.

Concept: The orthogonal projection of a vector $x$ onto a subspace $U$ is the unique vector $p \in U$ such that the error vector $(x - p)$ is orthogonal to the entire subspace $U$ .
The Projection Formula:
1. Form a Basis Matrix: Create a matrix $B$ whose columns are a basis for the subspace $U$ .
2. Solve the Normal Equation: Find the coordinate vector $λ$ by solving the system: $(B^{T} B) λ = B^{T} x$
3. Compute the Projection: The projection vector is then given by: $p = B λ$
Projection Matrix: The linear transformation that maps any vector $x$ to its projection $p$ is represented by the projection matrix $P = B (B^{T} B)^{- 1} B^{T}$ .
Special Case (Orthonormal Basis): If the columns of $B$ form an orthonormal basis, then $B^{T} B = I$ , and the formulas simplify significantly:
- The coordinates are $λ = B^{T} x$ .
- The projection is $p = (B B^{T}) x$ .

3. Rotations

Rotations are a fundamental class of geometric transformations that preserve the shape and size of objects. In linear algebra, they are represented by a special type of matrix.

Definition: A rotation is a linear transformation, represented by a matrix $R$ , that preserves lengths and angles.
Matrix Properties: A matrix $R$ represents a pure rotation if it satisfies two conditions:
1. Orthogonality: The matrix must be orthogonal, meaning $R^{T} R = I$ . This ensures that lengths and angles are preserved.
2. Orientation Preservation: The determinant of the matrix must be +1, i.e., $det (R) = 1$ . (An orthogonal matrix with determinant -1 represents a reflection, which is a "mirror image" transformation).
2D Rotation: The matrix for a counter-clockwise rotation by an angle $θ$ in the 2D plane is: $R_{θ} = (\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix})$
Group Structure: The set of all $n \times n$ rotation matrices forms a mathematical group known as the Special Orthogonal Group, denoted $S O (n)$ .

Part III: A Detailed Explanation of Orthogonal Projections

Projections are a critically important class of linear transformations with wide-ranging applications in computer graphics, coding theory, statistics, and machine learning.

1. The Importance and Concept of Orthogonal Projections

Motivation in Machine Learning: In machine learning, we often deal with high-dimensional data that is difficult to analyze or visualize. A key insight is that most of the relevant information is often contained within a much lower-dimensional subspace.
Goal (Dimensionality Reduction): By projecting high-dimensional data onto a carefully chosen low-dimensional "feature space," we can simplify problems, reduce computational costs, and extract meaningful patterns. The goal is to minimize information loss while performing this projection.
What is an Orthogonal Projection?
- It is a linear transformation that "drops" a vector from a high-dimensional space onto a lower-dimensional subspace.
- It is "orthogonal" because it preserves as much information as possible by minimizing the error (distance) between the original data and its projected image.
- This property makes it fundamental to linear regression, classification, and data compression.

2. Formal Definition and Properties of Projections

Algebraic Definition (Idempotence): A linear mapping $π : V \to U$ is called a projection if applying it twice has the same effect as applying it once. This is known as idempotence. $π^{2} = π (or π (π (x)) = π (x))$
- Matrix Form: A square matrix $P$ is a projection matrix if it satisfies $P^{2} = P$ .
Geometric Definition (Closest Point): The orthogonal projection of a vector $x$ onto a subspace $U$ , denoted $π_{U} (x)$ , is the unique point in $U$ that is closest to $x$ .
- This "closest point" condition is equivalent to the orthogonality condition: the difference vector $(x - π_{U} (x))$ must be orthogonal to every vector in the subspace $U$ .

3. Projection onto a One-Dimensional Subspace (a Line)

Image/Class/Mathematics-for-AI/5.png

We begin by deriving the projection formula for the simplest case: projecting a vector onto a line. Unless stated otherwise, we assume the standard dot product as the inner product.

Setup: Let $U$ be a one-dimensional subspace (a line through the origin) spanned by a non-zero basis vector $b$ .
Derivation: The projection $π_{U} (x)$ must be a scalar multiple of $b$ , i.e., $π_{U} (x) = λ b$ . We can solve for the coordinate $λ$ by using the orthogonality condition $⟨ x - λ b, b ⟩ = 0$ .
Final Formulas for 1D Projection:
- Coordinate: $λ = \frac{b^{T} x}{∥ b ∥^{2}}$
- Projection Vector: $π_{U} (x) = (\frac{b^{T} x}{∥ b ∥^{2}}) b$
- Projection Matrix: $P_{π} = \frac{b b^{T}}{∥ b ∥^{2}}$

4. Projection onto a General Subspace

Image/Class/Mathematics-for-AI/7.png

The three-step method for 1D projection can be generalized to any m-dimensional subspace $U \subseteq R^{n}$ .

Setup: Assume we have a basis ${b_{1}, \dots, b_{m}}$ for $U$ . Construct the basis matrix $B = [b_{1}, \dots, b_{m}]$ .
Derivation: The projection $π_{U} (x) = B λ$ . The orthogonality condition $B^{T} (x - B λ) = 0$ leads to the Normal Equation.
Final Formulas:
- Normal Equation: $B^{T} B λ = B^{T} x$
- Coordinates: $λ = (B^{T} B)^{- 1} B^{T} x$
- Projection Vector: $π_{U} (x) = B (B^{T} B)^{- 1} B^{T} x$
- Projection Matrix: $P_{π} = B (B^{T} B)^{- 1} B^{T}$

Extension: Projections between Subspaces

5. Core Application I: Gram-Schmidt Orthogonalization

Image/Class/Mathematics-for-AI/8.png

The Gram-Schmidt process is a classic algorithm for constructing an orthonormal basis, and its core idea is the repeated application of orthogonal projection.

Goal: To transform a set of linearly independent vectors ${b_{1}, \dots, b_{n}}$ into a set of orthogonal vectors ${u_{1}, \dots, u_{n}}$ that span the same subspace.
Iterative Construction:
1. First Step: Start by choosing the first vector as the beginning of the new basis. $u_{1} = b_{1}$
2. Subsequent Steps (k=2 to n): For each new vector $b_{k}$ , subtract its projection onto the already constructed orthogonal subspace $span {u_{1}, \dots, u_{k - 1}}$ . The remaining component is guaranteed to be orthogonal to that subspace. $u_{k} = b_{k} - π_{span {u_{1}, \dots, u_{k - 1}}} (b_{k})$ Since the ${u_{i}}$ are already orthogonal, the projection simplifies to: $u_{k} = b_{k} - \sum_{i = 1}^{k - 1} \frac{⟨ b_{k}, u_{i} ⟩}{⟨ u_{i}, u_{i} ⟩} u_{i}$
Obtaining an Orthonormal Basis (ONB): After obtaining each orthogonal vector $u_{k}$ , simply normalize it.
$e_{k} = \frac{u_{k}}{∥ u_{k} ∥}$
Example: Orthogonalize the basis $b_{1} = [2, 0]^{T}, b_{2} = [1, 1]^{T}$ in $R^{2}$ .
1. $u_{1} = b_{1} = [\begin{matrix} 2 \\ 0 \end{matrix}]$
2. $u_{2} = b_{2} - \frac{⟨ b_{2}, u_{1} ⟩}{⟨ u_{1}, u_{1} ⟩} u_{1} = [\begin{matrix} 1 \\ 1 \end{matrix}] - \frac{2}{4} [\begin{matrix} 2 \\ 0 \end{matrix}] = [\begin{matrix} 1 \\ 1 \end{matrix}] - [\begin{matrix} 1 \\ 0 \end{matrix}] = [\begin{matrix} 0 \\ 1 \end{matrix}]$
- The resulting orthogonal basis is ${[2, 0]^{T}, [0, 1]^{T}}$ .

**拓展：Cholesky分解

6. Core Application II: Projection onto an Affine Subspace

So far, we have discussed projections onto subspaces that pass through the origin. We now generalize this to affine subspaces (e.g., lines or planes that do not pass through the origin).

Definition: An affine subspace $L$ can be represented as $L = x_{0} + U$ , where $x_{0}$ is a support point (a displacement vector) and $U$ is a direction subspace parallel to $L$ that passes through the origin.
Solution Strategy: Translate-Project-Translate Back
1. Translate to the Origin: Subtract the support point $x_{0}$ from both the point to be projected, $x$ , and the affine space $L$ . This transforms the problem into projecting the new vector $(x - x_{0})$ onto the familiar direction subspace $U$ .
2. Standard Projection: Compute the orthogonal projection of the vector $(x - x_{0})$ onto the subspace $U$ , which is $π_{U} (x - x_{0})$ .
3. Translate Back: Translate the result back to its original position by adding the support point $x_{0}$ .
Final Formula:
$π_{L} (x) = x_{0} + π_{U} (x - x_{0})$
Mathematical Proof: We want to find the point $y^{*} \in L$ that minimizes the distance $∥ x - y ∥^{2}$ .
- Since $y \in L$ , it can be written as $y = x_{0} + u$ for some $u \in U$ .
- Minimizing $∥ x - (x_{0} + u) ∥^{2}$ is equivalent to minimizing $∥ (x - x_{0}) - u ∥^{2}$ .
- By definition, the $u^{*}$ that minimizes this distance is the projection of $(x - x_{0})$ onto $U$ , so $u^{*} = π_{U} (x - x_{0})$ .
- Therefore, the closest point is $y^{*} = x_{0} + u^{*} = x_{0} + π_{U} (x - x_{0})$ .
Distance from a Point to an Affine Subspace:
$d (x, L) = ∥ x - π_{L} (x) ∥ = ∥ x - (x_{0} + π_{U} (x - x_{0})) ∥ = ∥ (x - x_{0}) - π_{U} (x - x_{0}) ∥ = d (x - x_{0}, U)$
This shows that the distance from a point to an affine space is equal to the distance from the translated point to its corresponding direction subspace.

7. Core Application III: Projections and Least Squares Solutions

Moore Penrose Pseudo inverse
Orthogonal projection provides a powerful geometric framework for solving inconsistent linear systems of the form Ax=b, forming the foundation of the Least Squares Method.

The Root of the Problem: When the equation Ax=b has no solution, it means geometrically that the vector b does not lie within the column space of the matrix A, denoted Col(A).
The Solution Strategy: Since it's impossible to find a point in Col(A) that is equal to b, we settle for the next best thing: finding the point in Col(A) that is closest to b.
Projection is the Answer: By definition, this "closest point" is precisely the orthogonal projection of b onto Col(A). We denote this projection as $\hat{b} = π_{C o l (A)} (b)$ .
Solving the New Equation: We now solve a new system of equations that is guaranteed to have a solution: $A \hat{x} = \hat{b}$ The solution to this equation, $\hat{x}$ , is the least-squares solution to the original problem.
Why is it called "Least Squares"? Because this solution $\hat{x}$ is the one that minimizes the squared length of the error vector, $∥ b - A x ∥^{2}$ . The squared length of a vector is the sum of the squares of its components, hence the name "least squares."

Part IV: A Detailed Explanation of Rotations

Rotations are another important class of linear transformations, following projections, that play a central role in geometry, computer graphics, and robotics.

1. Fundamental Concepts of Rotation

Relationship with Orthogonal Transformations: A rotation is a special case of an orthogonal transformation. The core characteristic of orthogonal transformations is that they preserve the length of vectors and the angles between them. Rotations perfectly fit this description.
Definition: A rotation is a linear function that rotates a plane (or space) by a specific angle $θ$ around a fixed origin.
Directional Convention: By convention, a positive angle $θ > 0$ corresponds to a counter-clockwise rotation.
Core Property: A rotation only changes the direction of a vector, not its distance from the origin.

2. Rotations in $R^{2}$

2.1 Derivation of the $R^{2}$ Rotation Matrix: Two Perspectives

Perspective 1: Basis Change (The "Columns are Transformed Basis Vectors" View)

This is the standard derivation based on the nature of linear transformations.

Goal: To find a matrix $R (θ)$ that, when multiplied by a vector $x$ , rotates that vector counter-clockwise by an angle $θ$ around the origin.
Key Idea: The columns of a linear transformation matrix are the new coordinates of the standard basis vectors after the transformation.
Derivation Steps:
1. Identify Standard Basis Vectors: In $R^{2}$ , the standard basis is $e_{1} = [\begin{matrix} 1 \\ 0 \end{matrix}]$ and $e_{2} = [\begin{matrix} 0 \\ 1 \end{matrix}]$ .
2. Rotate the Basis Vectors:
  - Rotating $e_{1}$ by $θ$ yields the new coordinates $Φ (e_{1}) = [\begin{matrix} \cos θ \\ \sin θ \end{matrix}]$ .
  - Rotating $e_{2}$ by $θ$ yields the new coordinates $Φ (e_{2}) = [\begin{matrix} - \sin θ \\ \cos θ \end{matrix}]$ .
3. Construct the Rotation Matrix: Use the transformed basis vectors as the columns of the rotation matrix. $R (θ) = [Φ (e_{1}) Φ (e_{2})] = [\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}]$

Perspective 2: Polar Coordinates and Trigonometric Identities (The "Direct Geometric" View)

This is a more direct geometric proof that does not rely on the idea of basis change.

Represent the Vector: Express an arbitrary vector in polar coordinates: $x = r \cos ϕ, y = r \sin ϕ$ .
Represent the Rotation: Rotating this vector by an angle $θ$ changes its angle to $ϕ + θ$ . The new coordinates $(x^{'}, y^{'})$ are: $x^{'} = r \cos (ϕ + θ), y^{'} = r \sin (ϕ + θ)$
Apply Angle-Sum Formulas:
- $x^{'} = r (\cos ϕ \cos θ - \sin ϕ \sin θ) = x \cos θ - y \sin θ$
- $y^{'} = r (\sin ϕ \cos θ + \cos ϕ \sin θ) = x \sin θ + y \cos θ$
Write in Matrix Form: $[\begin{matrix} x^{'} \\ y^{'} \end{matrix}] = [\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}] [\begin{matrix} x \\ y \end{matrix}]$

2.2 Applications and Notes for $R^{2}$ Rotations

Coordinate System Dependency: The derived matrix $R (θ)$ only applies directly to vectors expressed in the standard Cartesian coordinate system.
Rotating a Vector in a Non-Standard Basis: If your vector's coordinates $x_{F}$ are given in a non-standard basis $F = [f_{1}, f_{2}]$ , the correct procedure is:
1. Change Basis to Cartesian: $x_{cartesian} = F x_{F}$
2. Rotate with Standard Matrix: $x_{rotated}^{'} = R (θ) x_{cartesian}$
3. (Optional) Change Basis Back: $y_{F} = F^{- 1} x_{rotated}^{'}$
4. $\Rightarrow M = F R (θ) F^{- 1}$
Rotating a Vector Space: To rotate an entire vector space, you simply rotate all of its basis vectors. If the basis of a space is given by the columns of a matrix $B = [b_{1}, \dots, b_{k}]$ , the new rotated basis matrix is $B^{'} = [R (θ) b_{1}, \dots, R (θ) b_{k}] = R (θ) B$ .

3. Rotations in $R^{3}$

Rotations in 3D are more complex than in 2D because they must occur around an axis of rotation.

Definition: In $R^{3}$ , a rotation is performed around a line (the axis) that passes through the origin. All points on the axis remain fixed during the rotation.
Constructing an Arbitrary 3D Rotation Matrix: A general 3D rotation matrix $R$ can be constructed by determining the new positions $R e_{1}, R e_{2}, R e_{3}$ of the standard basis vectors after rotation. These new vectors must remain orthonormal. $R = [R e_{1} R e_{2} R e_{3}]$

3.1 Directional Convention for 3D Rotations: The Right-Hand Rule

To define "counter-clockwise," we use the Right-Hand Rule:

Rule: Point the thumb of your right hand in the positive direction of the axis of rotation. The direction in which your other four fingers curl is the direction of a positive angle (counter-clockwise) rotation.

3.2 Fundamental Rotations about the Coordinate Axes

Any complex 3D rotation can be decomposed into a sequence of fundamental rotations about the three primary axes (x, y, z).

Rotation about the x-axis ( $e_{1}$ ) by $R_{x} (θ)$
- Description: The x-coordinate remains unchanged; the rotation occurs in the yz-plane.
- Matrix: $R_{x} (θ) = [\begin{matrix} 1 & 0 & 0 \\ 0 & \cos θ & - \sin θ \\ 0 & \sin θ & \cos θ \end{matrix}]$
Rotation about the y-axis ( $e_{2}$ ) by $R_{y} (θ)$
- Description: The y-coordinate remains unchanged; the rotation occurs in the zx-plane.
- Matrix: $R_{y} (θ) = [\begin{matrix} \cos θ & 0 & \sin θ \\ 0 & 1 & 0 \\ - \sin θ & 0 & \cos θ \end{matrix}]$
Rotation about the z-axis ( $e_{3}$ ) by $R_{z} (θ)$
- Description: The z-coordinate remains unchanged; the rotation occurs in the xy-plane.
- Matrix: $R_{z} (θ) = [\begin{matrix} \cos θ & - \sin θ & 0 \\ \sin θ & \cos θ & 0 \\ 0 & 0 & 1 \end{matrix}]$

3.3 Trigonometric Proof of the 3D Fundamental Rotation Matrices

About x-axis: In the yz-plane, let $y = r \cos ϕ, z = r \sin ϕ$ . After rotating by $θ$ , the new coordinates are $y^{'} = r \cos (ϕ + θ)$ and $z^{'} = r \sin (ϕ + θ)$ . Expanding with angle-sum formulas while keeping $x^{'} = x$ derives the $R_{x} (θ)$ matrix.
About z-axis: In the xy-plane, let $x = r \cos ϕ, y = r \sin ϕ$ . After rotating by $θ$ , the new coordinates are $x^{'} = r \cos (ϕ + θ)$ and $y^{'} = r \sin (ϕ + θ)$ . Expanding derives the $R_{z} (θ)$ matrix.
About y-axis (Special Case):
- According to the right-hand rule, with the thumb pointing along +y, the fingers curl from the z-axis towards the x-axis.
- Therefore, a rotation from x to z is considered a negative angle. Alternatively, to achieve a positive rotation, the new angle should be $ϕ - θ$ , not $ϕ + θ$ .
- $x^{'} = r \cos (ϕ - θ) = x \cos θ + z \sin θ$
- $z^{'} = r \sin (ϕ - θ) = z \cos θ - x \sin θ$
- This explains why the signs of $\sin θ$ in the $R_{y} (θ)$ matrix are different from the other two.

3.4 Sequential Rotations in 3D

Order Matters: In 3D and higher dimensions, rotation is not commutative. That is, $R_{x} R_{y} \neq R_{y} R_{x}$ . The order of application is critical.
Matrix Multiplication Order: Rotations are applied sequentially from right to left. If you want to rotate a vector $x$ first about the x-axis, then the y-axis, and finally the z-axis, the combined rotation is calculated as: $x_{rotated} = (R_{z} R_{y} R_{x}) x$

4. Rotations in Higher Dimensions ( $R^{n}$ ): Givens Rotations

Core Idea: Any rotation in n-dimensional space can be described as a rotation within a two-dimensional plane, while leaving the other $n - 2$ dimensions unchanged.
Definition: Givens Rotation
- An n-dimensional rotation matrix that performs a rotation in the $(i, j) - plane$ is denoted as $R_{i j} (θ)$ .
- Its structure is a modified identity matrix. Specifically, $R_{i} j (θ)$ is an 'n x n' matrix, and its form is as follows: $R_{i j} (θ) := [\begin{matrix} I_{i - 1} & 0 & 0 & 0 & 0 \\ 0 & \cos θ & 0 & - \sin θ & 0 \\ 0 & 0 & I_{j - i - 1} & 0 & 0 \\ 0 & \sin θ & 0 & \cos θ & 0 \\ 0 & 0 & 0 & 0 & I_{n - j} \end{matrix}]$
- This matrix is an identity matrix everywhere except for four entries, where a standard 2D rotation matrix is embedded:
  - $r_{i i} = \cos θ$
  - $r_{i j} = - \sin θ$
  - $r_{j i} = \sin θ$
  - $r_{j j} = \cos θ$
Application in Practice:
- In algorithms (like QR decomposition), we typically do not pre-set $θ$ . Instead, we back-calculate the values of $\cos θ$ and $\sin θ$ to achieve a specific goal, such as zeroing out an entry in a vector.

5. General Properties of Rotations

Orthogonality: Rotation matrices $R$ are always orthogonal matrices, satisfying $R^{T} R = I$ , which means $R^{T} = R^{- 1}$ .
Isometry (Distance Preservation): $∥ R x - R y ∥ = ∥ x - y ∥$ . Rotations do not change the distance between points.
Angle Preservation: The angle between vectors is unchanged after rotation.
Commutativity:
- In $R^{2}$ , rotations are commutative: $R (ϕ) R (θ) = R (θ) R (ϕ)$ .
- In $R^{3}$ and higher dimensions, rotations are not commutative.

Lecture 5: Matrix Decompositions: Determinant and Trace, Eigenvalues and Eigenvectors, Cholesky Decomposition, Eigendecomposition and Diagonalization