#eth/cil/theory

Matrix Approximation

Goal: Matrix Completion
Assume we have a sparse matrix $A$

Can we fill in the missing entries and reconstruct the matrix?

What makes a matrix a matrix (and not just a vector)?

The identity of columns and rows!

So our minimal assumption must be that entries within the same row or the same column are not independent.

We do this via conditional independence:
4-archive/cil/theory/old/assets/02-conditional-independence-def.png

Recommender Systems

We analyze patterns of interest in items (products, movies) and provide personalized recommendations for users.

Collaborative Filtering

We want to exploit collective data from many users and generalize across users, and possibly across items.
Example applications are Amazon, Netflix, online advertising, etc.

Netflix Data

Input: user-item preferences stored in a matrix
4-archive/cil/theory/old/assets/02-netflix-data-vis.png
Sparseness level can be very low, 1% and less

Formal definition:

$n$ number of users (rows)
$m$ number of items (columns)
$Ω$ observation matrix, where $w_{i j} = 1$ iff $a_{i j}$ is observed

Preprocessing / Normalization

Centering

make rows/columns more comparable and subtract out rating bias

Mean rating per user/item
4-archive/cil/theory/old/assets/02-mean-row-col.png
This lets us center the data:

Variance Normalization

$μ = E [X]$
$σ^{2} = Var [X]$
Then the noramlized scores or z-scores are:
$Z = \frac{X - μ}{σ}$

Convexity Definition

A set in a vector space or affine space is convex if the line segment connecting any two points in the set is also in the set

A function

f

is convex over a domain

R

, if for all

x, y \in R

f (t x + (1 - t) y) \leq t f (x) + (1 - t) f (y), \forall t \in [0; 1]

If $f$ is differentiable, convexity is equivalent to the condition:
$f (x) \geq f (y) + \nabla f (y) (x - y), \forall x, y \in R$
if $f$ is twice differentiable, convexity on a convex domain $R$ is equivalent to the condition:
$\nabla^{2} f (x) ⪰ 0, \forall x \in Int (R)$

Our scalar problem is non-convex (exception: $a = 0$ ), as there is a saddle point at the origin.

Convexity is a very fundamental property: demarcation line in optimization between tractable and (possibly) intractable problems

Gradients

Vector / Matrix notation
$\nabla_u l = Rv, \quad \nabla_vl = R^Tu, \quad R := (uv^T - A) $

Hessian Map

$ \nabla^2 l(u,v) = \left( \begin{matrix}
||v||^2 1_{n \times n} & 2uv^T - A\
(2uv^T - A)^T & ||u||^2 1_{m \times m}
\end{matrix} \right) $A t t h e o r i g i n :$ \nabla^2 l(0,0) = \left( \begin{matrix}
0 & -A\
-A^T & 0
\end{matrix} \right)$
This is scalar to the mulitdimensional case. The hessian at zero is not positive-semi-definit , unless $A$ is nilpotent (has all zero eigenvalues).

This problem is non-convex for all dimensions $m, n \geq 1$

Gradient Flow

Can we gain analytic insights into the gradient dynamics?

Use ODEs to study the gradient flow = gradient dynamics in the limit of small step sizes

Assume $u = v$ , balanced initialization and they evolve identically.
We look at $x = u v = u^{2}$ :
$\begin{aligned} \frac{d u}{d t} & = - v (u v - a) \\ \frac{d x}{d t} & = \frac{d u^{2}}{d t} = - 2 u v (u v - a) = - 2 x (x - a) \end{aligned}$
This ODE can be solved, yields:
$x (t) = \frac{a e^{2 a t}}{e^{2 a t} - 1 + (a / c)} = a + \frac{a c - a^{2}}{c e^{2 a t} + a - c}$
4-archive/cil/theory/old/assets/ode-solution.png

Fully observed Rank 1 Model

Useful identiy:
$| | R | |_{F}^{2} = tr (R^{T} R)$
We can then rewrite the objective:
$l (u, v) = \frac{1}{2} | | A - u v^{T} | |_{F}^{2}$

Optimality Conditions

Directionality of $u, v$ is purely determined by the last term:
$(u, v) \to max {u^{T} A v}, s.t. | | u | | = | | v | | = 1$
This can be solved with Lagrange multipliers:
$L = u^{T} A v - λ u \cdot u - μ v \cdot v$
First order optimality condition:
$\nabla_{u} L = A v - 2 λ u \overset{!}{=} 0 \Rightarrow u = \frac{A v}{| | A v | |}$

Eigenvector Equations

By putting both optimality conditions together we get the eigenvector equations:
$u \propto (A A^{T}) u, v \propto (A^{T} A) v$
$u$ should be proportional to the principal eigenvector of $A A^{T}$
$v$ should be proportional to the principal eigenvector of $A^{T} A$
Power iterations can be used for computation.

The generalization of this results in the Singular Value Decomposition