Problem 1.1 Linear Transformation of a Random Variable

LOTUS

LOTUS (Law of the Unconscious Statistician) Theorem:

E \[g (X)] = \sum_x g (x) P (X = x) E \[g (X)] = \int_- \infty^{\infty} g (x) f_X (x), d x if descrete if continuous

It’s all about sign cancelling

Transformation of probability density

the probability density function is defined as:

p_Y (y) = \frac{d}{d y} F_Y (y) = \frac{d}{d y} P (Y \leq y)

given $y = g (x)$ , to transform $P (Y \leq y)$ to probability about $X$ , slice $g$ .

if $g$ is monotonically decrease in $\[y_1, y_2]$

P (y_1 < Y \leq y_2) = P (g^{- 1} (y_2) < X \leq g^{- 1} (y_1)) = P (X \leq g^{- 1} (y_1)) - P (X \leq g^{- 1} (y_2))

if $g$ is monotonically increase in $\[y_1, y_2]$ :

P (y_1 < Y \leq y_2) = P (X \leq g^{- 1} (y_2)) - P (X \leq g^{- 1} (y_1))

take derivative of $x$ , apply chain rule:

\frac{d}{d y} \int_- \infty^{x} p_x (t), d t = p_x (x) \frac{d x}{d y}

and both cases generate the same expression:

\frac{d}{d x} P (y_1 < Y \leq y_2) = \frac{p _ x ( x _ 2 )}{g ^{'} ( y _ 2 )} - \frac{p _ x ( x _ 1 )}{g ^{'} ( y _ 1 )}

So for piecewise monotonic $g$ :

p_Y (y) = \frac{d}{d y} \sum_k P (a_k < Y \leq b_k) = p_X (x) \frac{d x}{d y}

Proof of LOTUS

this leads to LOTUS:

the potential negative sign of $\frac{d y}{d x}$ also cancel with the potential flipping of integral limit caused by substituting variables in definite integral：

E \[y] = \int_- \infty^{\infty} y p_Y (y), d y = \int_- \infty^{\infty} g (x) p_X (x) \frac{d x}{d y}, d (\frac{d y}{d x} x) = \int_- \infty^{\infty} g (x) p_X (x), d x

Solutions

scalar case

Proof of $E \[a x + b] = a E \[x] + b$

definition of $E \[X]$

E \[X] = \int_- \infty^{\infty} t f_X (t), d t

applying LOTUS:

E \[a x] E \[x + b] = \int_- \infty^{\infty} (a x) p (x), d x = a \int_- \infty^{\infty} x p (x), d x = a E \[x] = \int_- \infty^{\infty} x p (x), d x + \int_- \infty^{\infty} b p (x), d x = E \[x] + b

Thus:

E \[a x + b] = E \[(a x) + b] = E \[a x] + b = a E \[x] + b

Proof of $var (y) = a^{2} var (x)$

Definition of $var (X)$

\mathrm{var}(X) := \sigma^{2} :=\mathbb{E}\left\[ \left( X - \mu \right) ^{2} \right]

applying LOTUS, the direct consequence of the definition of $var$ is:

var (X) = E \[X^{2} - 2 μ X + μ^{2}] = E \[X^{2}] - 2 μ^{2} + μ^{2} = E \[X^{2}] - μ^{2}

so:

\begin{align} \mathrm{var}(x+b) &= \mathbb{E}\left\[(x+b)^{2}\right] - \left(\mathbb{E}\[x+b]\right)^{2} \\ &= \mathbb{E}\[x^{2} + 2bx + b^{2}] - \left( \mu^{2} + 2b\mu + b^{2} \right) \\ & = \mathbb{E}\[x^{2}] - \mu^{2} = \mathrm{var}(x) \end{align}

var (a x) = E \[(a x)^{2}] - (E \[a x])^{2} = a^{2} E \[x^{2}] - a^{2} μ^{2} = a^{2} var (x)

thus:

var (a x + b) = var (a x) = a^{2} v a r (x)

Vector case

Let’s say $x \in R^{d}$ , and $A \in R^{m \times d}, b \in R^{m}, y = A x + b$

Proof of $E \[y] = A E \[x] + b$ :

Definition of expectation on a vector/matrix of random variables:

E \[X]_{i, j} = E \[X_{i, j}]

in the case of vector or matrix, addition is element-wise, so it’s easy to know:

E \[x + b] E \[X + B] E \[x^{T}] = E \[x] + b = E \[X] + B = E \[x]^{T}

When it comes to vector-multiplication, assume $a$ is a vector, and $x$ is a random variable, then the $i$ -th element of vector $E \[a x]$ is:

E \[a x]_{i} = E \[a_{i} x] = a_i E \[x]

so $E \[a x] = a E \[x]$ .

When it comes to matrix-multiplication, applying linearity of expectation in scalar space:

E \[A x] = \sum_k = 1^{d} E \[a_k x_k] = \sum_k = 1^{d} a_k E \[x_k] = A E \[x]

thus:

E \[A x + b] = E \[A x] + b = A E \[x] + b

The definition of $cov (x)$ is:

cov (x)_{i, j} = cov (x_{i}, x_j) = E \[(x_i - μ_i) (x_j - μ_j)]

cov (x)_{i, j} = E \[x_{i} x_j - μ_i x_j - x_i μ_j + μ_i μ_j] = E \[x_i x_j] - μ_i μ_j

This match the outer product rule:

cov (x) = E \[x x^{T}] - μ μ^{T}

applying the rule $(x + y) (x + y)^{T} = x x^{T} + y x^{T} + x y^{T} + y y^{T}$ ：

cov (x + b) = E \[(X + b) (X + b)^{T}] - (μ + b) (μ + b)^{T} = E \[x x^{T} + b x^{T} + x b^{T} + b b^{T}] - (μ μ^{T} + b μ^{T} + μ b^{T} + b b^{T}) = E \[x x^{T}] - μ μ^{T} = cov (x)

E \[x^{T} A^{T}] = \sum_k = 1^{d} E \[x_k] a_k^{T} = E \[x]^{T} A^{T}

applying the rule $(A x) (A x)^{T} = A x x^{T} A^{T}$

cov (A x) = E \[(A x) (A x)^{T}] - E \[A x] E \[A x]^{T} = E \[A x x^{T} A^{T}] - (A E \[x]) (A E \[x])^{T} = A E \[x x^{T}] A^{T} - A E \[x] E \[x]^{T} A^{T} = A (E \[x x^{T}] - E \[x] E \[x]^{T}) A^{T} = A cov A^{T}

with the results above, we have:

cov (A x + b) = A cov (x) A^{T}

Problem 1.2 Properties of Independence

multivariate version of LOTUS (continuous case):

E \[g (x_1, \dots, x_n)] = \int_- \infty^{\infty} g (x_1, \dots, x_n) f (x_1, \dots, x_n), d x_1 \dots x_n

independency implies the joint probability density is the product of the marginal probability density:

X ⊥ Y ⟹ P (X \leq x, Y \leq y) = P (X \leq x) P (X \leq y) ⟹ f_X, Y (x, y) = f_X (x) f_Y (y)

By applying both LOTUS:

E \[x y] = \int_- \infty^{\infty} \int_- \infty^{\infty} (x y) f (x, y), d x d y = \int_- \infty^{\infty} \int_- \infty^{\infty} x y f (x) f (y), d x, d y = \int_- \infty^{\infty} x f (x), d x \int_- \infty^{\infty} y f (y), d y = E \[x] E \[y]

so for $cov (x, y)$ , it will be zero if $x$ and $y$ are independent:

cov (x, y) = E \[x y] - E \[x] E \[y] = 0

Problem 1.3 Uncorrelated vs Independence

Relationship uncorrelated: $\left{ (x,y) | \mathrm{cov}(x,y)=0 \right}$ uncorrelated $\neq ⟹$ independence, independence $⟹$ uncorrelated

$x$ and $y$ are uncorrelated:

\begin{align} \mu\_{x} &= \sum\_{x} \frac{1}{4} x = \frac{1}{4} (1 + 0 + -1 + 0) = 0 \\ \mu\_{y} & = \sum\_{y} \frac{1}{4} y = \frac{1}{4}(0 + 1 + 0 + -1) = 0 \\ \mathrm{cov}(x, y) &= \sum\_{x} \sum\_{y} \frac{1}{4} (x - \mu\_{x}) (y - \mu\_{y}) \\ &=\frac{1}{4} \sum\_{x} \sum\_{y} xy \\ &= \frac{1}{4} \left\[ (1\times 0) + (0 \times 1) + (-1 \times 1) + (0 \times -1) \right] \\ & = 0 \end{align}

marginal distribution:

p(x,y)	-1	0	1	p(x)
-1		1/4		1/4
0	1/4		1/4	1/2
1		1/4		1/4
p(y)	1/4	1/2	1/4	1

it’s apparently that $p (x = 0, y = 0) = 0$ , but $p (x = 0) p (y = 0) = \frac{1}{2}$ , $p (x, y) \neq = p (x) p (y)$ at $(0, 0)$ , so $x$ and $y$ are not independent.

definition of the conditional expectation:

E \[x ∣ y] = \int_X x f (x ∣ y) d x

by law of total expectation:

E \[X Y] = E \[E \[X Y ∣ Y]] = E \[Y E \[X ∣ Y]] = E \[Y E \[X]] = E \[X] E \[Y]

$E \[X Y] = E \[X] \[E \[Y] ⟹ cov (x, y) = 0$ is proven above, so $x$ and $y$ are uncorrelated.

Law of Total Expectation

Or called Iterated Expectation: take expectation on conditioned variables and then conditions generate the expectation on the joint distribution.

E \[E \[g (x, y) ∣ y]] = \int_Y E \[g (x, y) ∣ y] f (y) d y = \int_Y \int_X g (x, y) \frac{f ( x , y )}{f ( y )} d x f (y) d y = \int_Y \int_X g (x, y) f (x, y), d x d y = E \[g (x, y)]

Problem 1.4 Sum of Random Variables

For any $x, y$ :

E \[x + y] = \int_X \int_Y (x + y) f (x, y) d x d y = \int_X \int_Y x f (x, y) d x d y + \int_X \int_Y y f (x, y) d x d y = \int_X x (\int_Y f (x, y) d y) d x + \int_Y y (\int_X x f (x, y) d x) d y = \int_X x f_x (x) d x + \int_Y y f_y (y) d y = E \[x] + E \[y]

when $x$ and $y$ are independent, we have $E \[x y] = E \[x] E \[y]$ , so:

\begin{align} \mathrm{var}(x + y) &= \mathbb{E}\left\[ \left( x + y - \mu\_{xy} \right) ^2 \right] \\ &= \mathbb{E}\[x^2 + y^2 + 2xy -2\mu\_{xy}(x+y)+ \mu\_{xy}^{2}] \\ &= \mathbb{E}\[x^{2}] + \mathbb{E}\[y^{2}] + 2\mathbb{E}\[xy] - 2\mu\_{xy} (\mathbb{E}\[x] + \mathbb{E}\[y]) + \mu\_{xy}^{2} \\ &= \mathbb{E}\[x^{2}] + \mathbb{E}\[y^{2}] + 2\mathbb{E}\[x]\mathbb{E}\[y] - 2(\mathbb{E}\[x] + \mathbb{E}\[y] )^{2} + (\mathbb{E}\[x] + \mathbb{E}\[y])^{2} \\ &= \mathbb{E}\[x^{2}] + \mathbb{E}\[y^{2}] - \mathbb{E}\[x]^{2}-\mathbb{E}\[y]^{2} \\ &= \mathrm{var}(x) + \mathrm{var(y)} \\ \end{align}

Problem 1.5 Expectation of an Indicator Variable

E \[x] var (x) = 0 p (x = 0) + 1 p (x = 1) = p (x = 1) = (0 - p (x = 1))^{2} p (x = 0) + (1 - p (x = 1))^{2} p (x = 1) = p (x = 1)^{2} p (x = 0) + p (x = 0)^{2} p (x = 1) = p (x = 0) p (x = 1)

Problem 1.6 Multivariate Gaussian

Quadratic Form 二次型

x^{T} A x = \sum_i \sum_j a_i, j x_i x_j

Multivariate Gaussian: a probability density over real vectors (join probability)

p (x) = N (x ∣ μ, Σ) = \frac{1}{( 2 π ) ^{d /2} ∣ σ ∣ ^{1/2}} e^{- 1/2∣∣ x - μ ∣∣_Σ^{2}}

With a diagonal convariance matrix:

Σ = σ_1^{2} ⋮ 0 \dots ⋱ \dots 0 ⋮ σ_d^{2}

The Mahalanobis distance would be:

∣∣ x - μ ∣ ∣_{Σ}^{2} = i = 1 \sum d \frac{( x _ i - μ _ i ) ^{2}}{σ _ i ^{2}}

The determinant of $Σ$ would be:

∣Σ∣ = \prod_i = 1^{d} σ_i^{2}

So the density becomes the product of univariate gaussian, indicating that a diagonal covariant matrix make sure the variables are independent:

N (x ∣ μ, Σ) = \frac{1}{( 2 π ) ^{d /2} \prod _ i = 1 ^{d} σ _ i} exp (- \frac{1}{2} \sum_i = 1^{d} \frac{( x _ i - μ _ i ) ^{2}}{σ _ i ^{2}}) = \prod_i = 1^{d} \frac{1}{( 2 π ) ^{1/2} σ _ i} exp (- \frac{1}{2} \frac{( x _ i - μ _ i ) ^{2}}{σ _ i ^{2}}) = \prod_i = 1^{d} N (x_i ∣ μ_i, σ_i^{2})

with $μ = [00], Σ = [100 0.25]$ , the gaussian is squeezed in y dimension.

with $μ = [00], Σ = [1001]$ , all components are i.i.d.

eigen-decompose

Σ = V Λ V^{T}

let $y = V^{T} (x - μ)$ Mahalanobis distance could be:

∣∣ x - μ ∣ ∣_{Σ}^{2} = (x - μ)^{T} V Λ V^{T} (x - μ) = ((x - μ)^{T} V) Λ (V^{T} (x - μ)) = y^{T} Λ y = ∣∣ y ∣ ∣_{Λ}^{2}

$μ$ : centering $x$ $V$ : rotate and scale $x$ , so they are independent and normalized.

Eigenvalue and Eigenvector: $A v = λ v$ , Since $0$ is a solution for $(A - I λ) v = 0$ , to get additional values, its determinant should be zero. Apply this idea to $Σ$ :

det (Σ - I λ) = 0 (0.625 - λ)^{2} = 0.37 5^{2} λ_1 = 0.250, λ_2 = 1.000 v_1 = \[1, - 1], v_2 = \[1, 1]

Eigenvector define the direction, and small eigenvalue center the distribution, while large eigenvalue sparse the distribution (more covariance).

Problem 1.7 Product of Gaussian Distributions

\begin{align} \mathcal{N} (x | \mu\_{1}, \sigma\_{1}^{2}) &\mathcal{N} (x | \mu\_{2}, \sigma\_{2}^{2})\\ \= & \frac{1}{\sqrt{2\pi} \sigma\_{1}^{2}} \frac{1}{\sqrt{ 2\pi } \sigma\_{2}^{2}}\exp\left( -\frac{1}{2} \left( \frac{(x-\mu\_{1})^{2}}{\sigma\_{1}^{2}} + \frac{(x-\mu\_{2})^{2}}{\sigma\_{2}^{2}} \right) \right) \\ \= & \frac{1}{\sqrt{2\pi} \sigma\_{1}^{2}} \frac{1}{\sqrt{ 2\pi } \sigma\_{2}^{2}}\exp\left( -\frac{1}{2} \frac{\left( (\sigma\_{1}^{2} + \sigma\_{2}^{2})x^{2}-2(\mu\_{1}\sigma\_{2}^{2} + \mu\_{2}\sigma\_{1}^{2})x + (\mu\_{1}^{2}\sigma\_{2}^{2}+\mu\_{2}^{2}\sigma\_{1}^{2})\right)}{\sigma\_{1}^{2}\sigma\_{2}^{2}} \right) \\ \end{align}

$a x^{2} - 2 b x = a (x - \frac{b}{a})^{2} - \frac{b ^{2}}{a ^{2}}$

transform:

(σ_1^{2} + σ_2^{2}) (x - \frac{μ σ _ 2 ^{2} + μ _ 2 σ _ 1 ^{2}}{σ _ 1 ^{2} + σ _ 2 ^{2}})^{2} - (\frac{μ _ 1 σ _ 2 ^{2} + μ _ 2 σ _ 1 ^{2}}{σ _ 1 ^{2} + σ _ 2 ^{2}})^{2}

A Gaussian distribution can be identified:

σ_3^{2} μ_3 = \frac{σ _ 1 ^{2} σ _ 2 ^{2}}{σ _ 1 ^{2} + σ _ 2 ^{2}} = \frac{1}{σ _ 1 ^{- 2} + σ _ 2 ^{- 2}} = \frac{μ _ 1 σ _ 2 ^{2} + μ _ 2 σ _ 1 ^{2}}{σ _ 1 ^{2} + σ _ 2 ^{2}} = σ_3^{2} (μ_1 σ_2^{- 2} + μ_2 σ_1^{- 2})

Now process the rest of the expression

the remaining part of exponent part of $exp$ is $- \frac{1}{2}$ times:

\begin{align} &\frac{(\mu\_{1}^{2}\sigma\_{2}^{2}+\mu\_{2}^{2}\sigma\_{1}^{2})}{\sigma\_{1}^{2}\sigma\_{2}^{2}} - \frac{\left(\frac{\mu\_{1}\sigma\_{2}^{2}+\mu\_{2}\sigma\_{1}^{2}}{\sigma\_{1}^{2}+\sigma\_{2}^{2}}\right)^{2}}{\sigma\_{1}^{2}\sigma\_{2}^{2}} \\ \=& \frac{(\mu\_{1}^{2}\sigma\_{2}^{2}+\mu\_{2}^{2}\sigma\_{1}^{2})(\sigma\_{1}^{2}+\sigma\_{2}^{2})}{\sigma\_{1}^{2}\sigma\_{2}^{2}(\sigma\_{1}^{2}+\sigma\_{2}^{2})} - \frac{\mu\_{1}^{2}\sigma\_{2}^{4}+\mu\_{2}^{2}\sigma\_{1}^{4} + 2\mu\_{1}\mu\_{2}\sigma\_{1}^{2}\sigma\_{2}^{2}}{\sigma\_{1}^{2}\sigma\_{2}^{2}(\sigma\_{1}^{2}+\sigma\_{2}^{2})} \\ \=& \frac{\left(\mu\_{1}^{2}+\mu\_{2}^{2}\sigma\_{1}^{2}\sigma\_{2}^{-2}+\mu\_{1}^{2}\sigma\_{1}^{-2}\sigma\_{2}^{2} + \mu\_{2}^{2} \right) - \left(\mu\_{1}^{2}\sigma\_{1}^{-2}\sigma\_{2}^{2} + \mu\_{2}^{2}\sigma\_{1}^{2}\sigma\_{2}^{-2} + 2\mu\_{1}\mu\_{2} \right)}{\sigma\_{1}^{2}+\sigma\_{2}^{2}} \\ \=& \frac{(\mu\_{1} - \mu\_{2})^{2}}{\sigma\_{1}^{2}+\sigma\_{2}^{2}} \end{align}

This could also be treated as an fixed value Gaussian:

\frac{1}{2 π ( σ _ 1 ^{2} + σ _ 2 ^{2} )} exp (- \frac{1}{2} \frac{( μ _ 1 - μ _ 2 ) ^{2}}{σ _ 1 ^{2} + σ _ 2 ^{2}}) = N (μ_1 ∣ μ_2, σ_1^{2} + σ_2^{2}) = Z

And the Remaining factor is:

\begin{align} &\sqrt{ 2\pi(\sigma\_{1}^{2} + \sigma\_{2}^{2}) } \frac{1}{\sqrt{ 2\pi } \sigma\_{1}^{2}} \frac{1}{\sqrt{ 2\pi}\sigma\_{2}^{2}} \\ \=& \frac{\sigma\_{1}^{-2}+ \sigma\_{2}^{-2}}{\sqrt{ 2\pi }} \\ \=& \frac{1}{\sqrt{ 2\pi }\sigma\_{3}^{2}} \end{align}

which match the form of Gaussian together with the exponential part. So the overall distribution expression is:

N (x ∣ μ_1, σ_1^{2}) N (x ∣ μ_2, σ_2^{2}) = Z N (x ∣ μ_3, σ_3^{2})

The resulting scaled Gaussian means this could not hold unless the scaling factor is 1

Problem 1.8 Product of Multivariate Gaussian Distributions

\begin{align} & \mathcal{N}(x | a, A) \mathcal{N}(x | b, B) \\ \=& \frac{1}{(2\pi)^{d/2}|A|^{1/2}}\exp\left( -\frac{1}{2}||x-a||_{A}^{2} \right) \frac{1}{(2\pi)^{d/2}|B|^{1/2}}\exp\left( -\frac{1}{2}||x-b||_{B}^{2} \right) \\ \=& \frac{1}{(2\pi)^{d}|A|^{1/2}|B|^{1/2}} \exp\left( -\frac{1}{2} \left( (x-a)^TA^{-1}(x-a) + (x-b)^T B^{-1} (x-b) \right) \right) \\ \end{align}

expand the exponent term, and then complete the square:

\begin{align} \&x^T(A^{-1}+B^{-1})x-2x^T(A^{-1}a +B^{-1}b) + a^TA^{-1}a + b^TB^{-1}b \\ \=& (x - c)^T(A^{-1} + B^{-1}) (x - c) + e \\ \\ C=&(A^{-1}+B^{-1})^{-1} \\ c=& C(A^{-1}a + B^{-1}b) \\ e=& (a^T A^{-1} a + b^TB^{-1}b) - (A^{-1}a + B^{-1}b)^T(A^{-1}+B^{-1})^{-1}(A^{-1}a + B^{-1}b) \\ \=& \dots& \end{align}

so the Gaussian part of the product is:

\frac{1}{( 2 π ) ^{d /2} ∣ C ∣ ^{1/2}} exp (- \frac{1}{2} ((x - c)^{T} (A^{- 1} + B^{- 1}) (x - c)))

leaving the coefficient as:

\frac{∣ A ^{- 1} + B ^{- 1} ∣ ^{1/2}}{( 2 π ) ^{2/ d} ∣ A ∣ ^{1/2} ∣ B ∣ ^{1/2}}

to figure out $e$ , apply Woodbury matrix identity to simplify the nested inverse. Simply applying Woodbury Identity introduce two terms, so we make further transformation:

\begin{align} (A^{-1} + B^{-1})^{-1} &= A - A(B + A)^{-1}A \\ &= A(B+A)^{-1} \left\[ (B+A) - A \right] \\ & = A(A+B)^{-1}B \end{align}

so the last term of e should be:

\begin{align} & (a^TA^{-1} + b^{T}B^{-1})(A(A+B)^{-1}B)(A^{-1}a + B^{-1}b) \\ \=& (a^{T} + b^{T}B^{-1}A)(A+B)^{-1} (BA^{-1}a + b) \\ \= & a^{T}(A+B)^{-1}BA^{-1}a + b^{T}B^{-1}A(A+B)^{-1}BA^{-1}a + a^{T}(A+B)^{-1}b + b^{T}B^{-1}A(A+B)^{-1}b \\ \end{align}

Combining the two terms in $e$ would have:

\begin{align} & \left(a^{T}A^{-1}a - a^{T}(A+B)^{-1}BA^{-1}a \right) \\ \=& a^{T} \left( I - (A+B)^{-1}B \right) A^{-1}a \\ \=& a^{T}(A+B)^{-1}a \end{align}

also:

\begin{align} & (b^{T}B^{-1}b - b^{T}B^{-1}A(A+B)^{-1}b) \\ \=& b^{T}B^{-1}(I - A(A+B)^{-1})b \\ \=& b^{T}(A+B)^{-1}b \end{align}

observe the second term, we apply Exchange Identity $A (A + B)^{- 1} B = B (A + B)^{- 1} A$ here:

b^{T} B^{- 1} A (A + B)^{- 1} B A^{- 1} a = b^{T} (A + B)^{- 1} a

and since $(A + B)^{- 1}$ is symmetric, so $b^{T} (A + B)^{- 1} a = a^{T} (A + B)^{- 1} b$

e = a^{T} (A + B)^{- 1} a - 2 a^{T} (A + B)^{- 1} b + b^{T} (A + B)^{- 1} b = (a - b)^{T} (A + B)^{- 1} (a - b)

and the coefficient is:

\begin{align} & \frac{|A^{-1}+B^{-1}|^{1/2}}{(2\pi)^{d/2} |A|^{1/2}|B|^{1/2}} \\ \=& \frac{|A(A+B)^{-1}B|^{1/2}}{(2\pi)^{d/2} |A|^{1/2}|B|^{1/2}} \\ \=& \frac{1}{(2\pi)^{d/2}|A+B|^{1/2}} \end{align}

This transformation procedure use a lot rules of Determinant

The coefficient and the redundant exponential term also form a fixed-value Gaussian:

Z = N (a ∣ b, A + B)

So the product of multivariable Gaussian is:

N (x ∣ a, A) N (y ∣ b, B) = Z N (x ∣ c, C)

Woodbury Matrix Identity

Inverting after low rank modification with low cost

(A + U C V)^{- 1} = A^{- 1} - A^{- 1} U (C^{- 1} + V A^{- 1} U)^{- 1} V A^{- 1}

Motivation: if you’ve inverted a large matrix $A$ , inverting it after added a low-rank update $U C V$ should not have to start from scratch

A + U C V

$C$ : the core update, $k \times k$
$U, V$ : bring the update $C$ into the space of $A$ , $n \times k, k \times n$

How to construct the invert

subtract the correction: adding something to the original matrix usually “shrink” the inverse $\begin{align} &(A+UCV)(A^{-1} - \mathrm{Correction})\\ \=& (I + UCVA^{-1}) - (A+UCV)(\mathrm{Correction}) \\ \=& I + UCVA^{-1} - UC(C^{-1}U^{-1}A + V)(\mathrm{Correction}) \end{align}$

To cancel the last to terms, we can build correction with a sandwich structure to match the terms of $U C V A^{- 1}$ :

there is a $V A^{- 1}$ in the back
we don’t want to invert $U^{- 1}$ , even it’s invertible, so there should be a $A^{- 1} U$ in the form
let’s call the remaining part $R$ So the Correction term should have the form: $Correction = A U R V A^{- 1}$ so we have the last term as: $U C (C^{- 1} + V A U) R V A^{- 1}$ The only unknown matrix here is $R$ , compare with $U C V A^{- 1}$ , the only difference is the last term has a $(C^{- 1} + V A U) R$ inside, making this $I$ so we are able to cancel these two terms:

R = (C^{- 1} + V A U)^{- 1}

So the correction term should be

A^{- 1} U (C^{- 1} + V A U)^{- 1} V A^{- 1}

the inverse should belike:

A^{- 1} - A^{- 1} U (C^{- 1} + V A U)^{- 1} V A^{- 1}

Notice that we assume $U$ is invertible, which does not always hold. But surprisingly this form also hold when $U$ is not invertible. It can be verified by the same process above.

Problem 1.9 Correlation between Gaussian Distributions

Warning

Do not mix the concept of Correlation and Covariance

correlation between distribution is defined as:

\int f_X (x) f_Y (x) d x

Using problem 1.8, The correlation between two Gaussian distributions is:

\begin{align} & \int \mathcal{N}(x|a,A)\mathcal{N}(x|b,B) dx \\ \=& \int \mathcal{N}(a|b, A + B) \mathcal{N} (x|c, C) dx \\ \=& \mathcal{N}(a|b, A+ B) \end{align}

Problem 1.10 Completing the square

f (x) d e = (x - d)^{T} A (x - d) + e = (x^{T} - d^{T}) A (x - d) + e = x^{T} A x - d^{T} A x - x^{T} A d + d^{T} A d + e = x^{T} A x - 2 x^{T} A d + d^{T} A d + e = x^{T} A x - 2 x^{T} b + c = A^{- 1} b = c - d^{T} A d = c - b^{T} (A^{- 1})^{T} b

Problems 1.11 Eigenvalues

trace is defined as the sum of the diagonal

Tr (A) = \sum_i = 1^{n} A_ii

eigenvalues are roots of the characteristic function:

p (λ) = det (A - λ I) = \prod_i = 1^{n} (λ_i - λ)

the latter equivalence is guarantee by Identity Theorem for Polynomials

let $λ = 0$ , then:

p (0) = det (A) = \prod_i = 1^{n} λ_i

Problems 1.12 Eigenvalues of an inverse matrix

Problems 1.13 Positive Definiteness

Kai's Public Notes

Explorer

Problem Set 1

Problem 1.1 Linear Transformation of a Random Variable

LOTUS

Transformation of probability density

Proof of LOTUS

Solutions

scalar case

Vector case

Problem 1.2 Properties of Independence

Problem 1.3 Uncorrelated vs Independence

Law of Total Expectation

Problem 1.4 Sum of Random Variables

Problem 1.5 Expectation of an Indicator Variable

Problem 1.6 Multivariate Gaussian

Problem 1.7 Product of Gaussian Distributions

Problem 1.8 Product of Multivariate Gaussian Distributions

Woodbury Matrix Identity

Problem 1.9 Correlation between Gaussian Distributions

Problem 1.10 Completing the square

Problems 1.11 Eigenvalues

Problems 1.12 Eigenvalues of an inverse matrix

Problems 1.13 Positive Definiteness

Graph View

Table of Contents