Bayesian Decision Theory

Problem

optimal decisions in problems with uncertainty

Framework

states:
- prior: $P (Y)$ = probability of a state
observer
- that measures features from r.v. $X$
- class conditional distribution — conditional on state (class)
  - one CCD for each states $P (X ∣ Y)$
Decision function: use observation to make a decision about the state
- $g (x) : X \to Y$
Loss function - penalizes for deciding the wrong $Y$ (the state)
- $L (g (X), Y) \geq 0$
- 0-1 Loss: ${0, 1, g (X) = Y g (X) \neq = Y$

Bayesian Decision Rule

Risk - expected value of the loss function

E_{X Y} [L (g (X), Y)] = y \sum \int_{x} p (x, y) L (g (x), y) d x = \int_{x} y \sum p (x) p (y ∣ x) L (g (x), y) d x = \int_{x} p (x) y \sum p (y ∣ x) L (g (x), y) d x = E_{X} [R (X)]

Since $p (x) \geq 0$ , $L (g (x), y) \geq 0$ , then minimizing the risk can be achieved by minimizing the conditional risk for each $x$ .

For a particular $X$ , choose a class that minimize the risk:

g^{*} (x) = g^{*} = ar g j \in Y min R (x) = ar g j \in Y min y \sum p (y ∣ x) L (j, y) = ar g j \in Y min E_{Y ∣ X} [L (j, y)]

This is the Bayesian Deciison Rule.

Classification and 0-1 loss

settings:

Y \in {1, \dots, C} L (g (x), y) = {0, 1, g (x) = y g (x) \neq = y

conditional risk:

R (X) = E_{Y ∣ X} [L (g (X), Y)] = P [g (X) \neq = Y ∣ X] = 1 - P [g (X) = Y ∣ X] =

BDR:

y^{*} = ar g j \in Y min R (X) = ar g j \in Y min 1 - P [g (X) = Y ∣ X] = ar g j \in Y max P [g (X) = Y ∣ X] same as MAP rule: selecting the largest posteriori

Equivalently:

y^{*} = ar g j \in Y max P (X ∣ Y = j) P (Y = j)

Example - 2 class classification

given $x$ : pick $0$ if $p (x ∣0) p (0) > p (x ∣1) p (1)$ , evquivently:

LRT \frac{p ( x ∣0 )}{p ( x ∣1 )} > threshold \frac{p ( 1 )}{p ( 0 )} = T

Summary:

for 0-1 loss function

BDR is the MAP rule (tells threshold for Likelihood Ratio Test)
Risk = prob of error
BDR minimizes prob of error (no other decision rule is better)
caveats: assuming our models are corect! (the CCD and the prior)

This is called a generative model

Use data to learn the CCDs (modeling how features are generated)
use the CCD in decision rule

Example: Noisy Channel

decode $x$ :

g (x) = {0, 1, x < T x \geq T

Goal: given $X$ , recover the bit $Y$ Model:

prior: $P (y = 0) = P (y = 1) = \frac{1}{2}$
CCD: assume gaussian additive noise: $X = μ_{y} + ϵ, ϵ \sim N (0, σ^{2})$
- $P (x ∣ y = 0) = N (0, μ_{0})$
- $P (x ∣ y = 1) = N (1, μ_{1})$

BDR for 0-1 Loss:

y^{*} = ar g j \in Y max ln p (X ∣ Y = j) + ln p (Y = j) = ar g j \in Y max - (x - μ_{j})^{2} = ar g j \in Y min (x - μ_{j})^{2} = ar g j \in Y min - 2 μ_{j} x + μ_{j}^{2}

Hence: pick $0$ when:

- 2 μ_{0} x + μ_{0}^{2} 2 x (μ_{1} - μ_{0}) x < - 2 μ_{1} x + μ_{1}^{2} < μ_{1}^{2} - μ_{0}^{2} > \frac{μ _{1} + μ _{0}}{2}

Gaussian Classifier

CCD are Guassians
No assumption on prior

Special case:

Assume $Σ_{j} = σ^{2} I$ (shared isotropic covariance)
$g_{j} (x) = w^{T} x + b_{j}$ (discriminant function)

w^{T} (x - x_{0}) = 0

define a hyper plane passes through $x_{0}$ that is normal to $w$

Goal: Find optimal $g^{*} (x)$ for the given assumptions (prior, CCD, Lossfunction)

Mode, Mean, Median

Loss function for every predict-true value pair

Conditional Risk

Given x, the risk of the system is:

R (x) = \int_{Y} L (g (x), y)) p (y ∣ x) d y

Total Risk:

\int_{x} R (x) p (x) d x

choose the action $α_{i}$ that gives the smallest possible value for $R (α_{i} ∣ x)$

likelihood, loss, prior

To minimize total risk, minimize conditional risk.

Two class decisioning: Likelihood Ratio Test (LRT)

LRT:

L(g(x), y) p(y|x)

Dimensionality

Weirdness of high dimensional

volume of sphere = $\frac{π ^{d /2} r ^{d}}{Γ ( \frac{d}{2} + 1 )}$

Gamma Function

Volume of hyper cube: $(2 r)^{d}$

f_{d} = \frac{volume of sphere}{volume of cube} = \frac{π ^{d /2}}{2 ^{d} Γ ( \frac{d}{2} + 1 )}

d \to \infty lim f_{d} = 0

corner vector $c$ , and axis vector $ρ$

cos θ = \frac{c ^{T} ρ}{∣∣ c ∣∣ ∣∣ p ∣∣} = \frac{r ^{2}}{r d r} = \frac{1}{d}

d \to \infty lim cos θ = 0 ⟹ axis ρ ⊥ angle c

hypersphere shell of thickness $ϵ$

outside sphere $s_{2}$ , inside sphere $s_{1}$

V_{s h e ll} \frac{V ( S _{1} )}{V ( S _{2} )} = V (S_{2}) - V (S_{1}) = (1 - \frac{V ( S _{1} )}{V ( S _{2} )}) V (S_{2}) = (1 - \frac{ϵ}{r})^{d}

d \to \infty lim \frac{V _{S_{1}}}{V _{S_{2}}} = 0

all the volume is inside the shell

High dimensional Gaussian

$X \sim N (0, σ^{2} I_{d})$

E [∣∣ x ∣ ∣^{2}] = E [x_{1}^{2} + x_{2}^{2} + \dots + x_{d}^{2}] =

Central Limit Theorem the sum / average is concentrated around the mean s $d \to \infty$

\frac{1}{d} ∣∣ X ∣ ∣^{2} \sim N (σ^{2}, \frac{1}{d})

as $d$ increases, the $\frac{1}{d} ∣∣ X ∣ ∣^{2}$ converges to $σ^{2}$ Thus length of almost all samples vectors will be $σ^{2}$ Thus in high dim, a Gaussian is a shell of sphere $σ d$ and most of the density is in this shell. The point of the max density is still the mean (0)

Curse of Dimensionality

In theory, adding new features will not increase $ρ (error)$ informative feature

informative features uninformative feature CCDs still overlap

quality of CCD estimates

density estimates for high-dim need more data! e.g. high-dim histogram $[0, 1]^{d}$
on average suppose we want 1 sample / bin

In general, desired training set size = $O (e^{p})$ , $p$ is the number of parameters.

CCD

CCD = Class Conditional Desnity P(x | C_i)

reduce number of parameters (complexity of model)
reduce number of features (dimensionality reduction) reduce # parameters
create more data
- Bayesian formulation (virtual samples)
- data augmentation

Linear Dimensionality Reduction

summarize correlated features with fewer features
How?

The data lives in a low-dimensional subspace

linear operation

Principal Component Analysis (PCA)

Idea: if the data lives in a subspace it will look flat in some directions.

if we fit a Gaussian, it will be highly skewed

eigen-decomposition: $Σ = V Λ V^{T}$

each $v_{i}$ defines an axis of ellipse
each $λ_{i}$ defines the width along axis.

keep the large eigenvalues

select $v_{i}$ with the largest eigenvalues (principle components) to find the subspace where data “lives”

Receipt of PCA:

calculate Gaussian: $μ = \frac{1}{n} \sum_{i = 1}^{N} x_{i}$ , $Σ = \frac{1}{n} \sum_{i =}^{N} (x_{i} - μ) (x_{i} - μ)^{T}$
eigen decomp: $Σ = V Λ V^{T}$
sort eigenvalues for largest to smallest
select top-k eigenvectors: $Φ = [v_{1}, \dots, v_{k}]$
project $X$ onto $Φ$ : $z = Φ^{T} (x - μ) \in R^{K}$
$z$ as new feature vector, BDR as usual

Notes: This selection of $Φ$

maximizes the variance of the projected training data $\sum_{i = 1}^{N} ∣∣ z_{i} ∣ ∣^{2}$
minimizes the reconstruction error of training data

\overset{x_{i}}{^} = Φ (z_{i} + μ)

can be implemented efficiently using SVD pick a $k$ that works pick $k$ to preserve $p %$ of variance of data

Assumption - “noise” variance is smaller than the signal variance

PCA is optimal for representation (but not necessarily for classification)

no way to fit it! (we don’t use class information)

Linear Discriminant Analysis (Fisher Linear Discriminant)

Find a linear projection that best separate the classes

input space (x)

class mean: $u_{j} = \frac{1}{n _{j}} \sum_{x_{i} \in x_{j}} x_{i}$

1-d space (z)

$m_{j} = w^{T} μ_{j}$

	input space	1-d space
class mean	$u_{j} = \frac{1}{n _{j}} \sum_{x_{i} \in x_{j}} x_{i}$	$m_{j} = w^{T} μ_{j}$
class scatter	$S_{j} = \sum_{x_{i} \in C_{j}} (x_{i} - μ_{j}) (x_{i} - μ_{j})^{T}$	$s_{j} = w^{T} S_{j} w$
input	$x_{i}$	$w^{T} x_{i}$

Idea: maximize distance between proj.

problem: $w$ is unconstrained $⟹$ need normalization

Fisher’s Idea:

w^{*} = ar g w max \frac{( m _{1} - m _{2} ) ^{2}}{S _{1} + S _{2}} = ar g w max \frac{( m _{1} - m _{2} ) ^{T} ( m _{1} - m _{2} )}{S _{1} + S _{2}} = ar g w max \frac{( μ _{1} - μ _{2} ) ^{T} w w ^{T} ( μ _{1} - μ _{2} )}{w ^{T} ( S _{1} + S _{2} ) w} =

高内聚，低耦合

Kai's Public Notes

Explorer

Bayesian Decision Theory

Problem

Framework

Bayesian Decision Rule

Classification and 0-1 loss

Example: Noisy Channel

Gaussian Classifier

Two class decisioning: Likelihood Ratio Test (LRT)

Dimensionality

Weirdness of high dimensional

High dimensional Gaussian

Curse of Dimensionality

Linear Dimensionality Reduction

Principal Component Analysis (PCA)

Linear Discriminant Analysis (Fisher Linear Discriminant)

Graph View

Table of Contents