Linear Classifiers

Generative Classifier

Learn CCDs from data: $p (x ∣ y)$
BDT to get decision rule $p (y ∣ x) \to g (x)$

Data used in step 1, decision is secondary.

Density estimation is an ill-posed problem (difficult)
- which density to use?

Vapnik's advice

when solving a given problem, try to avoid solving a more difficult problem as an intermediate step.

Solve for decision rule directly

Linear Classifier (binary)

input: $X \in R^{d}$
output: $y \in + 1, - 1$
find a linear function: $f (x) = w^{T} x, w \in R^{d}$
- $w$ separate input space into 2 halp-spaces
- $w$ points to into the pos-space
- $f (x) = 0$ is the dec boundary

Decision Rule

g = sgn (f (x)) = {+ 1, - 1, f (x) > 0 f (x) < 0

bias term can be included in $w$

$f(\tilde{x}) = \tilde{w}\tilde{x}= \begin{bmatrix} w \\ b \\ \end{bmatrix} \begin{bmatrix} x \\ 1 \end{bmatrix} \= w^{T}x+b$

Training Set $D = x_i, y_i_i = 1^{n} = X, y$

Given a $w$ :

$y_i (w^{T} x_i) > 0 ⟹$ correctly classified $(x_i, y_i)$
$y_i (w^{T} x) < 0 ⟹$ misclassified $(x_i, y_i)$

Idea case: 0-1 loss failed because $f (x) = {0, x > 01, x < 0$ has only 0 or undefined gradient.

w \* = arg max_w \sum_i = 1^{n}

Least Squares Classification (Label Regression)

\overset{w}{^} = arg min_w \sum_i = 1^{n} (y_i - w^{T} x_i)^{2} = (X X^{T})^{- 1} X y

FLD is a version of LSC

Perceptron

Rosenblatt 1962

criteria: only look at misclassified points

M = set of misclassified points = $i ∣ y_i w^{T} x_i < 0$

loss function, larger loss for for misclassified points far from boundary:

E (w) = \sum_i \in M - y_i w^{T} x_i

Perceptron Algorithm: $w^{\*} = arg min_w E (w)$

look at one sample at a time and minimize (gradient descent)

w_t + 1 = w_t + ηy_i x_i

it’s now called stochastic gradient descent, SGD

How to set the learning rate $η$

rotate $w$ towards the misclassified point
length of $w$ increases in each iteration, each update has less effect than prev

Rosenblatt proved SGD converges in $(\frac{R}{γ})^{2}$ iterations if the data is linearly separate $R = max_i ∣∣ X_i ∣∣$

$γ$ : for $∣∣ \overset{w}{^} ∣∣ = 1$ , $\forall i, y_i \overset{w}{^} x_i > γ$ $\overset{w}{^}$ : the optimal unit weight vector that perfectly separates the two classes with the maximum possible margin.

many possible solution, based on initialization
does not converge if data is not linearly separable.

Logistic Regression

(probabilistic Approach)

Binary class: $y \in 0, 1$

PS6-7: when the CCD are Gaussian, $p (x ∣ y) = N (x; μ_y, Σ_y)$ , the posterior is $p (y ∣ x)$ is a sigmoid function

p (y = 1∣ x) = \frac{1}{1 + e ^{- f (x)}} = σ (f (x))

where f(x) is linear

Sigmoid:

σ (z) = \frac{1}{1 + e ^{- z}}

with BDR, $f (x)$ was determined by CCD. Now we directly learn $f (x)$

linear func: $f (x) = w^{T} x$ prob:

p (y = 1∣ x) = σ (w^{T} x) = π = \frac{1}{1 + e ^{- w^{T} x}}

Decision Rule:

\overset{y}{^} = {1, 0, p (y = 1∣∣ x) > \frac{1}{2}, or w^{T} x > 0 otherwise

Look at # parameters

BDR-Gauss: $O (d^{2})$
Logistic Regression: $O (d)$ , less likely to overfit.

Learning: parameter estimation

let $π_i = σ (w^{T} x_i)$ Bernoulli likelihood: $p (y_i ∣ x_i, w) = π_i^{y_i} (1 - π_i)^{1 - y_i}$ Data log-likelihood:

l (w) = \sum y_i ln π_i + (1 - y_i) ln (1 - π_i)

MLE:

\overset{w}{^} = arg max_w l (w) = arg min_w \sum_i = 1^{n} - y_i lo g π_i - (1 - y_i) lo g (1 - π_i)_Loss: Binary Cross Entropy

Find zero-crossings of gradient: Newton-Raphson Method

w^{(new)} = w^{(old)} - \[Hessian \nabla^{2} l (w)]^{- 1} (gradient \nabla l (w))

w^{(new)} R z = (XR X^{T})^{- 1} XR z = diag (π_i (1 - π_i), \dots, π_n (1 - π_n)) = X^{T} w^{(o l d)} - R^{- 1} (π - y) weights

weighted lesat squares: weights are $R$ , target is $z$

R: weights depend on $w_i$ , weight higher on unconfident predictions z: $f^{o l d} w$ - error between pred $π_i$ and $y_i$ target depends on $w$

IRLS, IRWLS: iterative reweighted least squares

Comparison of Error (loss) functions

all have the form:

\overset{w}{^} = arg min_w \sum_i = 1^{n} L (f (x_i), y_i)_empirical risk

“empirical risk minimization” - reduce the training error.

let $z_i = y_i w^{T} x_i$

Ideal 0-1 loss

L = {0, 1, z_i > 0 z_i < 0

LSC:

L = (z_i - 1)^{2}

penalizing too correct answers

Perceptron:

L = max (0, - Z_i)

Logistic Regression:

L = \frac{1}{lo g ( z )} lo g (1 + e^{- z_i})

some loss for correctly classified points: the effect is to push. the boundary away from the nearby points.

Loss for LSC & LR are convex approx to 0-1 loss

Kai's Public Notes

Explorer

Discriminative Learning

Generative Classifier

Linear Classifier (binary)

Least Squares Classification (Label Regression)

Perceptron

Logistic Regression

Comparison of Error (loss) functions

Graph View

Table of Contents