Problem
optimal decisions in problems with uncertainty
Framework
- states:
- prior: = probability of a state
- observer
- that measures features from r.v.
- class conditional distribution — conditional on state (class)
- one CCD for each states
- Decision function: use observation to make a decision about the state
- Loss function - penalizes for deciding the wrong (the state)
- 0-1 Loss:
Bayesian Decision Rule
Risk - expected value of the loss function
Since , , then minimizing the risk can be achieved by minimizing the conditional risk for each .
For a particular , choose a class that minimize the risk:
This is the Bayesian Deciison Rule.
Classification and 0-1 loss
settings:
conditional risk:
BDR:
Equivalently:
Example - 2 class classification
given : pick if , evquivently:
Summary:
for 0-1 loss function
- BDR is the MAP rule (tells threshold for Likelihood Ratio Test)
- Risk = prob of error
- BDR minimizes prob of error (no other decision rule is better)
- caveats: assuming our models are corect! (the CCD and the prior)
This is called a generative model
- Use data to learn the CCDs (modeling how features are generated)
- use the CCD in decision rule
Example: Noisy Channel
decode :
Goal: given , recover the bit Model:
- prior:
- CCD: assume gaussian additive noise:
BDR for 0-1 Loss:
Hence: pick when:
Gaussian Classifier
- CCD are Guassians
- No assumption on prior
Special case:
- Assume (shared isotropic covariance)
- (discriminant function)
define a hyper plane passes through that is normal to
Goal: Find optimal for the given assumptions (prior, CCD, Lossfunction)
Mode, Mean, Median
Loss function for every predict-true value pair
Conditional Risk
Given x, the risk of the system is:
Total Risk:
choose the action that gives the smallest possible value for
likelihood, loss, prior
To minimize total risk, minimize conditional risk.
Two class decisioning: Likelihood Ratio Test (LRT)
LRT:
L(g(x), y) p(y|x)
Dimensionality
Weirdness of high dimensional
volume of sphere =
Volume of hyper cube:
corner vector , and axis vector
hypersphere shell of thickness
outside sphere , inside sphere
all the volume is inside the shell
High dimensional Gaussian
Central Limit Theorem the sum / average is concentrated around the mean s
as increases, the converges to Thus length of almost all samples vectors will be Thus in high dim, a Gaussian is a shell of sphere and most of the density is in this shell. The point of the max density is still the mean (0)
Curse of Dimensionality
In theory, adding new features will not increase informative feature
informative features uninformative feature CCDs still overlap
quality of CCD estimates
- density estimates for high-dim need more data! e.g. high-dim histogram
- on average suppose we want 1 sample / bin
In general, desired training set size = , is the number of parameters.
CCD
CCD = Class Conditional Desnity P(x | C_i)
- reduce number of parameters (complexity of model)
- reduce number of features (dimensionality reduction) reduce # parameters
- create more data
- Bayesian formulation (virtual samples)
- data augmentation
Linear Dimensionality Reduction
- summarize correlated features with fewer features
- How?
The data lives in a low-dimensional subspace
linear operation
Principal Component Analysis (PCA)
Idea: if the data lives in a subspace it will look flat in some directions.
if we fit a Gaussian, it will be highly skewed
eigen-decomposition:
- each defines an axis of ellipse
- each defines the width along axis.
keep the large eigenvalues
select with the largest eigenvalues (principle components) to find the subspace where data “lives”
Receipt of PCA:
- calculate Gaussian: ,
- eigen decomp:
- sort eigenvalues for largest to smallest
- select top-k eigenvectors:
- project onto :
- as new feature vector, BDR as usual
Notes: This selection of
- maximizes the variance of the projected training data
- minimizes the reconstruction error of training data
can be implemented efficiently using SVD pick a that works pick to preserve of variance of data
Assumption - “noise” variance is smaller than the signal variance
PCA is optimal for representation (but not necessarily for classification)
no way to fit it! (we don’t use class information)
Linear Discriminant Analysis (Fisher Linear Discriminant)
- Find a linear projection that best separate the classes
input space (x)
class mean:
1-d space (z)
| input space | 1-d space | |
|---|---|---|
| class mean | ||
| class scatter | ||
| input | ||
Idea: maximize distance between proj.
problem: is unconstrained need normalization
Fisher’s Idea:
高内聚,低耦合