The Operator-Theoretic View of Neural Networks
Context & Rationale
The curriculum emphasizes a "coordinate-free" approach to linear algebra. In modern research (e.g., LoRA, Geometric Deep Learning), we treat layers as operators acting on function spaces. This assignment tests your ability to think about properties like "rank" and "eigenvalues" as intrinsic geometric properties rather than artifacts of a specific basis.
Problems & Tasks
Let be a finite-dimensional vector space over , and let be a linear operator (representing a recurrent weight matrix).
- Proof: Prove that if the spectral radius , then for all , regardless of the norm chosen.
- Application: Consider a Recurrent Neural Network (RNN) with a linear activation function . Using the result above, formally explain why the state vector decays to zero if all eigenvalues of lie inside the unit circle.
- Extension: Why does the Spectral Theorem (Axler Ch 7) imply that for symmetric (Hermitian) weight matrices, the operator norm is exactly equal to the largest eigenvalue? How does this simplify the analysis of gradient explosion?
The "Manifold Hypothesis" suggests data lies on low-dimensional subspaces. SVD is the tool to find them.
- Derivation: Starting from the Spectral Theorem for the positive operator , rigorously derive the Singular Value Decomposition for an arbitrary operator .
- Low-Rank Approximation: Prove the Eckart-Young-Mirsky theorem for the Frobenius norm: The best rank- approximation of a matrix is given by truncating its SVD to the top singular values.
Blog Post: "The Compression Instinct"
- Implement SVD from scratch in NumPy (using np.linalg.eig on as a primitive, but assembling the yourself).
- Take a high-dimensional weight matrix from a pre-trained open-source model (e.g., a layer from BERT-tiny or a small ResNet). Compute its singular value spectrum. Plot the cumulative energy of the singular values.
- Write a post discussing "Intrinsic Dimensionality." If 90% of the variance is captured by the top 10% of singular values, what does this imply about the redundancy of neural networks? Relate this to the "Low-Rank Adaptation" (LoRA) technique.
- Volume Concentration: Derive the formula for the volume of a -dimensional hypersphere of radius . Prove that as , the volume of the unit sphere concentrates at the equator (or that the ratio of the sphere's volume to the enclosing cube's volume goes to zero).
- Implication: In a short essay, explain how this "curse of dimensionality" affects nearest-neighbor search in vector databases (a key component of RAG systems). Why does Euclidean distance lose meaning in 1000-dimensional spaces?
Deliverables
- A LaTeX-typeset problem set (PDF)
- A companion blog post visualization