Problem

optimal decisions in problems with uncertainty

Framework

  • states:
    • prior: = probability of a state
  • observer
    • that measures features from r.v.
    • class conditional distribution — conditional on state (class)
      • one CCD for each states
  • Decision function: use observation to make a decision about the state
  • Loss function - penalizes for deciding the wrong (the state)
    • 0-1 Loss:

Bayesian Decision Rule

Risk - expected value of the loss function

Since , , then minimizing the risk can be achieved by minimizing the conditional risk for each .

For a particular , choose a class that minimize the risk:

This is the Bayesian Deciison Rule.

Classification and 0-1 loss

settings:

conditional risk:

BDR:

Equivalently:

Example - 2 class classification

given : pick if , evquivently:

Summary:

for 0-1 loss function

  • BDR is the MAP rule (tells threshold for Likelihood Ratio Test)
  • Risk = prob of error
  • BDR minimizes prob of error (no other decision rule is better)
  • caveats: assuming our models are corect! (the CCD and the prior)

This is called a generative model

  1. Use data to learn the CCDs (modeling how features are generated)
  2. use the CCD in decision rule

Example: Noisy Channel

decode :

Goal: given , recover the bit Model:

  • prior:
  • CCD: assume gaussian additive noise:

BDR for 0-1 Loss:

Hence: pick when:

Gaussian Classifier

  • CCD are Guassians
  • No assumption on prior

Special case:

  • Assume (shared isotropic covariance)
  • (discriminant function)

define a hyper plane passes through that is normal to

Goal: Find optimal for the given assumptions (prior, CCD, Lossfunction)

Mode, Mean, Median

Loss function for every predict-true value pair

Conditional Risk

Given x, the risk of the system is:

Total Risk:

choose the action that gives the smallest possible value for

likelihood, loss, prior

To minimize total risk, minimize conditional risk.

Two class decisioning: Likelihood Ratio Test (LRT)

LRT:

L(g(x), y) p(y|x)

Dimensionality

Weirdness of high dimensional

volume of sphere =

Gamma Function

Volume of hyper cube:

corner vector , and axis vector

hypersphere shell of thickness

outside sphere , inside sphere

all the volume is inside the shell

High dimensional Gaussian

Central Limit Theorem the sum / average is concentrated around the mean s

as increases, the converges to Thus length of almost all samples vectors will be Thus in high dim, a Gaussian is a shell of sphere and most of the density is in this shell. The point of the max density is still the mean (0)

Curse of Dimensionality

In theory, adding new features will not increase informative feature

informative features uninformative feature CCDs still overlap

quality of CCD estimates

  • density estimates for high-dim need more data! e.g. high-dim histogram
  • on average suppose we want 1 sample / bin

In general, desired training set size = , is the number of parameters.

CCD

CCD = Class Conditional Desnity P(x | C_i)

  • reduce number of parameters (complexity of model)
  • reduce number of features (dimensionality reduction) reduce # parameters
  • create more data
    • Bayesian formulation (virtual samples)
    • data augmentation

Linear Dimensionality Reduction

  • summarize correlated features with fewer features
  • How?

The data lives in a low-dimensional subspace

linear operation

Principal Component Analysis (PCA)

Idea: if the data lives in a subspace it will look flat in some directions.

if we fit a Gaussian, it will be highly skewed

eigen-decomposition:

  • each defines an axis of ellipse
  • each defines the width along axis.

keep the large eigenvalues

select with the largest eigenvalues (principle components) to find the subspace where data “lives”

Receipt of PCA:

  1. calculate Gaussian: ,
  2. eigen decomp:
  3. sort eigenvalues for largest to smallest
  4. select top-k eigenvectors:
  5. project onto :
  6. as new feature vector, BDR as usual

Notes: This selection of

  1. maximizes the variance of the projected training data
  2. minimizes the reconstruction error of training data

can be implemented efficiently using SVD pick a that works pick to preserve of variance of data

Assumption - “noise” variance is smaller than the signal variance

PCA is optimal for representation (but not necessarily for classification)

no way to fit it! (we don’t use class information)

Linear Discriminant Analysis (Fisher Linear Discriminant)

  • Find a linear projection that best separate the classes

input space (x)

class mean:

1-d space (z)

input space1-d space
class mean
class scatter
input

Idea: maximize distance between proj.

problem: is unconstrained need normalization

Fisher’s Idea:

高内聚,低耦合