For classification, we are interested in conditional probabilities. We teach a computer by providing examples.
$$ \mathcal{X}:\text{images}\\ \mathcal{Y} : \text{labels}\\ \mathcal{D} : \mathcal{X} \times \mathcal{Y} $$
For example, if $\mathcal{Y}$ is a boolean space, we will try and find $P(y=1 | x)$. $P(y=0 | x) = 1 - P(y=1 | x)$.
We can represent images as a vector in $\R^d$. We do this with some feature extractor $\phi : \mathcal{X} \to \R^d$.
After extracting the features, we can classify images by finding a hyperplane that splits the images into two classes.
We then classify probabilities with a sigmoid function.
$$ \sigma (s) = \frac{1}{1+ e^{-s}}. $$
Each function $h$ defined as
$$ h(x; \mathbf{w}, b) = \sigma(\mathbf{w}^T\phi(x) + b) $$
is called a hypothesis. A family is called a hypothesis class. They are indexed by $\mathbf{w}$ and $b$. To find the best hypothesis, we need to find the best $\mathbf{w}$ and $b$. $\mathbf{w}$ and $b$ are called parameters.
Use training set to find the best hypothesis.
Given $(x,y)$, and candidate hypothesis $h(\cdot; \mathbf{w}, b)$, $h (x, \mathbf{w}, b)$ is estimated probability label is 1. Idea: compute estimated probability for true label $y$.
The likelihood of a single data point is
$$ li(h; \mathbf{w}, b), y) = h(x; \mathbf{w}, b)^y(1- h(x; \mathbf{w}, b))^{(1-y)} $$