For classification, we are interested in conditional probabilities. We teach a computer by providing examples.

$$ \mathcal{X}:\text{images}\\ \mathcal{Y} : \text{labels}\\ \mathcal{D} : \mathcal{X} \times \mathcal{Y} $$

For example, if $\mathcal{Y}$ is a boolean space, we will try and find $P(y=1 | x)$. $P(y=0 | x) = 1 - P(y=1 | x)$.

Linear Algebra

We can represent images as a vector in $\R^d$. We do this with some feature extractor $\phi : \mathcal{X} \to \R^d$.

Linear classifiers

After extracting the features, we can classify images by finding a hyperplane that splits the images into two classes.

A hyperplane is the graph of the equation $w^Tx + b = 0$.

We then classify probabilities with a sigmoid function.

$$ \sigma (s) = \frac{1}{1+ e^{-s}}. $$

Each function $h$ defined as

$$ h(x; \mathbf{w}, b) = \sigma(\mathbf{w}^T\phi(x) + b) $$

is called a hypothesis. A family is called a hypothesis class. They are indexed by $\mathbf{w}$ and $b$. To find the best hypothesis, we need to find the best $\mathbf{w}$ and $b$. $\mathbf{w}$ and $b$ are called parameters.

Training

Use training set to find the best hypothesis.

Given $(x,y)$, and candidate hypothesis $h(\cdot; \mathbf{w}, b)$, $h (x, \mathbf{w}, b)$ is estimated probability label is 1. Idea: compute estimated probability for true label $y$.

The likelihood of a single data point is

$$ li(h; \mathbf{w}, b), y) = h(x; \mathbf{w}, b)^y(1- h(x; \mathbf{w}, b))^{(1-y)} $$