The pinhole camera model has some limitations because we don't know where along a ray an object lies.
Multiple eyes are an evolutionary trait because they help us see depth. Objects that are closer to us move faster than objects farther away.
Edges, because they are invariant to lighting and noise. They are easy to differentiate because of gradient magnitude.
The main idea is that translating a patch should have large change in intensity.
Suppose we have a window $W$ that we can shift by $(u,v)$. At each patch, we have a vector $\phi_0$ which is a list of intensities. To see how much the intensity shifts, we calculate the distance between the initial and final vectors. Let $E$ be the change in appearance if the window moves by $u, v$.
$$ \phi_0 = [I(0,0), I(0, 1), \dots, I(n,n)]\\ \phi_1 = [I(u, v), I (u, 1+v), \dots, I(n+u, n+v)]\\ E(u,v) = \|\phi_0 -\phi_1\|2^2\\ =\sum{(x,y)\in W}[I(x,y) - I(x+u, y+v)]^2. $$
To compute this $I( x+u, y+v)$, we can use a Taylor series approximation.
$$ f(x) = f(x_0) + Df(x_0)(x-x_0) + \frac{1}{2}D^2f(x_0)(x-x_0)^2 + \dots $$