Icosian Reflections

…a tendency to systematize and a keen sense

that we live in a broken world.

CS 181 (Machine Learning)

These are my lecture notes from CS 181: Machine Learning (Canvas), taught by Ryan Adams at Harvard. I am an enrolled student, and these notes are in no way official; they're reproduced by me from Prof. Adams's lectures in Spring 2015, and are licensed Creative Commons Attribution-NonCommercial-ShareAlike 4.0, which supercedes any other site-wide licenses.
Creative Commons License

\(W_1\): Supervised Learning, Linear Regression, Maximum Likelihood (1)

Some Math

  • An input space \(\X\).
  • An output space \(\Y\).
  • An induction function \(y:\X\to\Y\).

Notation: \(y:\X\to\Y\) means "\(y\) is a function from \(\X\) to \(\Y\).

Examples, by \(\mathscr Y\)

  • Regression (predicting some scalar): \(\Y=\R\).
  • Classification (sorting into categories): \(\Y=\{y_1,y_2,\ldots\}\).
  • Ordinal regression (putting things in order): \(\mathscr Y=\N\).
  • Structured prediction (e.g. segmentation, ranking): \(\mathscr Y\) is some other combinatorially-defined space.

Sample Method: Nearest Neighbors

  • Works for regression, classification.
  • Given an unknown thing, take a "vote" over the \(k\) "nearest neighbors", i.e. the things it's most like.
  • If \(k\) is too large, you're just defaulting to the average.
  • If \(k\) is too small, you've got a bad problem with noise.
  • Maybe weighting all neighbors by 'nearness' solves this problem?
  • Nonparametric, because it's not building an explicit parametrization of \(\mathscr Y\) (maybe this is a good thing?).
  • Assumes that the desired task looks a lot like the training data.
  • this'll be an important point throughout the class)
  • sim. that you've got enough training data...
  • Slow/fat, since you need to keep all of your training data around to compute metrics on.
  • Generally good as an exploratory, thrown-against-the-wall method.

What does it mean to be "good at" the problem?

  • Usually, we'll talk about a loss function -- a quantitative concept of what it means to do badly at a prediction.
  • e.g. 0/1 loss (for binary classification) -- a 0 if you're right, and a 1 if you're not.
  • But be careful, sometimes misclassifications aren't symmetrically bad, even in the binary case.
  • (e.g. false-pos vs. false-neg in cancer diagnosis)
  • So in general, a loss function is a function \((\text{prediction},\text{actual})\to\R\).
  • e.g. for regression, squared error \(\big(y(x_0)-y_0\big)^2\) -- we like this a lot because it's differentiable at zero, the minimizer is the mean, &c.
    Read more elsewhere: Ben Kuhn | Why Squared Error?
  • Alternatively, absolute error \(\big|y(x_0)-y_0\big|\).
  • Maybe this makes sense for money?
  • In reinforcement-learning, we'll often talk about reward functions instead.

A parable: Choosing the wrong loss/reward function. In grad school, I [RA] was trying to get a robot (a trashcan with wheels) to roll down a hallway without bumping into things. So I gave it a function that rewarded it when it moved and punished it when it ran into things.

So of course, it learned to turn in tight little circles.

Linear Regression

  • Input format: \(\vec x\in\R^D\).
  • Output format: \(t\in\R\).
  • \(\ds{y(\vec x,\vec w)=w_0+\sum_i^Dw_ix_i}\).
  • Data: \(\{(x_n,t_n)\}_{n=1}^N\).
  • Game: Find a good \(\vec w\). (we'll use least-squares)
    \[E_D\lr{\vec w}\dq\frac12\sum_{n=0}^N\lr{t_n-\lr{\vec w\tr\vec x}_n}^2=\f12\n{\vec t-\vec w\tr\vec x}_2.\]

Notation: \(\vec x\tr\) means "the transpose of \(\vec x\)", i.e. flip it about the diagonal. For a vector, this means turn it sideways.

  • ...with the convention that \(\vec x_0\dq1\), for convenience.
  • So \(w^*\dq\argmin_{\vec w}E_D\lr{\vec w}\).

Notation: \(\argmin_xf(x)\) means "the \(x\) that minimizes \(f(x)\)".

But we can solve (minimize) this analytically!

  • Find the gradient \(\grad_{\!\vec w}E_D\).

Notation: \(\grad:(\R^n\to\R)\to(\R^n\to\R)\)
\[\grad f\dq\aryc{\f{\del f}{\del x_1}\\\f{\del f}{\del x_2}\\\vdots\\\f{\del f}{\del x_D}}\]
\[\grad f(\vec x)\dq\aryc{\f{\del f}{\del x_1}(\vec x)\\\f{\del f}{\del x_2}(\vec x)\\\vdots\\\f{\del f}{\del x_D}(\vec x)}\]

Basically, \(\grad\) takes a function and gives you a vector of its partial derivatives in every direction.

  • Set \(\grad_{\!\vec w}E_D(\vec w)=0\).
  • See \[\grad_{\!\vec w}E_D\lr{\vec w}=-\sum_n^N\lr{t_n-\lr{\vec w\tr\vec x}_n}x_n.\]
  • We can write this as a design matrix \(X\):
    \[X\dq\aryc{x_1\tr\\x_2\tr\\\vdots\\x_N\tr}=\aryf{(\vec x_1)_1&(\vec x_1)_2&\ldots&(\vec x_1)_4\\(\vec x_2)_1&(\vec x_2)_2&\ldots&(\vec x_2)_4\\\vdots&\vdots&\ddots&\vdots\\(\vec x_N)_1&(\vec x_N)_2&\ldots&(\vec x_N)_4}\]
    \[\vec t=\aryc{t_1\\t_2\\\vdots\\t_N}.\]
  • So then we have
    \[E_D(\vec w)=\f12\lr{X\vec w-\vec t}\tr\lr{X\vec w-\vec t}\]
    \[\grad_{\vec w}E_D\lr{\vec w}=X\tr\lr{X\vec w-\vec t}=0\]
    \[X\tr X\vec w=X\tr\vec t\]
    \[\vec w=\lr{X\tr X}^{-1}X\tr\vec t.\]

Alternative Mindset: Maximum-Likelihood

  • Goodness of param = \(P(\text{data}|\text{param is true})\).
  • ...to be continued...

Quote: We're living in a novel age: not limited by our computing power; not limited by our data resources. We're really only limited by our understanding.


Linear regression, a review of last week

Digression: Matrix calculus looks a lot more intimidating than it is -- when in doubt, you can just write out everything, properly indexed, and bash through it that way. But one thing that helps, when I'm dealing with some particularly thorny matrix calculation, is to write out rectangles, label what each axis corresponds to (or even just how long they are), and make sure everything matches up.

At other times in your undergraduate CS training, you've been doing discrete math and continuous math isolated from each other. Warning: all of a sudden, you're going to have to remember all sorts of calculus you haven't had to deal with since high school, and if you feel rusty, that's normal.

  • Data: \(\s{\lr{\vec x_n,t_n}:x_n\in\R^D}_{n=1}^N\), with \(\lr{\vec x_n}_0=c\)
  • Data Matrix: \[X\in\R^{N\x D}\dq\aryc{\vec x_1\tr\\\vec x_2\tr\\\vdots\\\vec x_N\tr}\] \[\vec t\dq\aryc{t_1\\t_2\\\vdots\\t_N}\]
  • Regression Function: \(y\lr{\vec w,\vec x}\dq\vec w\tr\vec x\)
  • Error: \(E_D\lr{\vec w}=\f12\sum_{n=1}^N\lr{t_n-y\lr{\vec w,\vec x_n}}^2=\f12\sum_{n=1}^N\lr{t_n-\vec w\tr\vec x_n}^2=\f12\lr{\vec t-X\vec w}\tr\lr{\vec t-X\vec w}\)
  • (figure 2)
  • ...

Reminder: \(\grad f\lr{\vec x}=0\mim\vec x\tis{a critical point}\), similarly to \(f'(x)=0\mim x\tis{a critical point}\).

\[\grad_{\vec w}E_D\lr{\vec w}=\f12\grad_{\vec w}\lr{\vec t\tr\vec t-2\vec w\tr X\tr\vec t+\vec w\tr X\tr X\vec w}\]
\[=\f12\grad_{\vec w}\vec t\tr\vec t-\grad_{\vec w}\vec w\tr X\vec\tr\vec t+\f12\grad_{\vec w}\vec w\tr X\tr X\vec w\]

Some identities: \[\grad_{\vec z}\vec z\tr\vec a=\grad_{\vec z}\vec a\tr\vec z=a\]
\[\grad_{\vec z}\vec z\tr A\vec z=\lr{A+A\tr}\vec z\]

[=0-X\tr\vec t+X\tr X\vec w\text{\ldots{}solve for }=0]

  • (figure 2)

Question: How do we know that we have a global minimum, and not some other criticum?

Answer: This problem is globally convex (uniformly bowl-shaped). Other, later problems are going to almost always be nonconvex

Question: What if \\(X\tr X\\) isn't invertible?

Answer: That'll happen if you have more dimensions than datapoints...which might give you problems for reasons which may be obvious.


  • \(t_n=y\lr{\vec w,\vec x_n}+\ep_n\)
  • \(\ep_n\sim N(0,\bt^{-1})\)
  • \(\bt^{-1}\) variance
  • \(\bt\) precision
  • Likelihood for \(n\)th datum: \(p\lgr{t_n}{\vec x_n,\vec w,\bt}=N\lgr{t_n}{\vec x_n\tr\vec w,\bt^{-1}}\)
  • Likelihood for all \(N\) data: \(p\lgr{\vec t}{X,\vec w,\bt}=\prod_{n=1}^NN\lgr{t_n}{\vec x_n\tr\vec w,\bt^{-1}}\)
  • So we take the log...why? Because products suck. (Specifically, in the context of floating-point arithmetic.)
  • Log-ikelihood for all \(N\) data: \(\log\lr{p\lgr{\vec t}{X,\vec w,\bt}}=\sum_{n=1}^N\log\lr{N\lgr{t_n}{\vec x_n\tr\vec w,\bt^{-1}}}\)


Bayesian Inference

(see also An Intuitive Explanation of Bayes' Theorem, and A Technical Explanation of Technical Explanation, by E. Yudkowsky)

"A hypothesis-processing machine"
\[p(\te)\cdot p\lgr{\text{DATA}}\te\propto p\lgr\te{\text{DATA}}\]
Then new data (DATA') arrive...
\[p\lgr\te{\text{DATA},\text{DATA'}}\propto p\lgr\te{\text{DATA}}\cdot p\lgr{\text{DATA'}}\te\]
\[\propto p(\te)\cdot p\lgr{\text{DATA}}\te\cdot p\lgr{\text{DATA'}}\te\]

As it turns out, "\(\te\)" / "DATA" is a terrible naming convention if you're ever going to say it out loud.

A conjugate prior has the property that, when exposed to data (with some \(p\lgr{\text{DATA}}\te\)), the posterior has the same functional form (up to, e.g. concentration). In general, these will be the exponential family distributions:
\[p\lgr{\text{DATA}}{\vec\te}\propto\Exp{\vec\te\tr\vec T(\text{DATA})}\]
\[p\lgr{\vec\te}{\text{DATA}}\propto\Exp{\lr{\vec T(\text{DATA})+\vec\psi}\tr\vec\te}\]

(example: beta-bernoulli)

Other examples:


Multivariate Gaussian in N dimensions

  • Unknown mean \(\vec\mu\sim N(\vec m_0,S_0)\) (prior on \(\mu\))
  • Known covariance \(\Sg\)
  • Data:\(\s{\vec x_n}_{n=1}^N\vec x_n\in\R^D\)
    \[N\lgr{\vec x}{\vec\mu,\Sg}=(2\pi)^{-\nf D2}\abs{\Sg}^{-/nf12}\Exp{-\f12\lr{\vec x-\vec\mu}\tr\Sg\tr\lr{\vec x-\vec\mu}}\]
    \[p\lgr{\vec\mu}{\vec m_0,S_0,\s{\vec x_n}_{n=1}^N}\propto N\lgr{\vec\mu}{m_0,S_0}\prod_nN\lgr{\vec x_n}{\vec\mu,\Sg}\]
    \[=C-\f12\lr{\vec\mu\tr S_0^{-1}\vec\mu-2\vec\mu\tr S_0^{-1}\vec m_0+2\vec\mu\tr\Sg^{-1}\lr{\sum_{n=1}^N\vec x_n}+N\vec\mu\tr\Sg^{-1}\vec\mu}\]
    and so:
    \[\vec m_N=S_N\lr{S_0^{-1}\vec m_0+N\Sg^{-1}\vec x_N}\]
    \[\log p\lgr{\vec\mu}{\vec m_0,S_0,\Sg,\s{\vec x_n}_{n=1}^N}=-\f12\lr{\vec\mu-\vec m_N}\tr S_N^{-1}\lr{\vec\mu-\vec m_N}+C\]

\(W_4\): Classification

  • Features: \(\vec x\to\vec\phi\lr{\vec x}\).
  • k classes: \(c_1,\hh,c_k\)
  • In particular cases, these may be (partially, well) ordered, but in general, they need not be.
  • Data: \(\s{(\vec x_n,t_n)}_{n=1}^N\)
  • Binary \((k=2)\) is a common case: \(t_n\in\2\), often notated \(t_n\in\s{-1,1}\)
  • Multi-class often given as \(\vec t_n\in\2^k\), with \(\vec t_n=(;0;0;1;0;0;)\tr\).

Linear Classification

Decision boundary (planar function): \(\vec w\tr\vec x+w_0\gtrless0\)

  • Note that \(\vec w\) will be normal to the plane (and that the origin is \(-\f{w_0}{\n{\vec w}}\)) from the plane).

Linear separability: the property that the data can be separated by straight lines (planes)

Caution: Least-squares-ing the output (as if classes were integers) is dumb, and dumb in the stupidest possible way -- it chokes on even-more-obvious examples.


(see also Wikipedia)

Activation function:
\[f(a)=\pcwz{+1\pif a\geq0\-1\pif a<0}\]
\[f\lr{\vec w\tr\vec\phi\lr{\vec x}}=y\lr{\vec x}\]

"Ideal" Loss:
\[E\lr{\vec w}\dq-\sum_{n=1}^Nf\lr{\vec w\tr\vec\phi\lr{\vec x_n}}t_n\]

  • Not differentiable, so we can't do a 0-gradient analysis.

Perceptron Loss:
\[E_p\lr{\vec w}\dq-\sum_{n=1}Nt_n\vec w\tr\vec\phi\lr{\vec x_n}\]

  • Guaranteed to find a perfectly separating hyperplane if one exists (but nnec a maximally-separating one).

Perceptron Learning:

initialize \(\vec w\) arbitrarily
while(\(\vec w\) is wrong) :
    for n in [1,N] :
        if predict(n) is wrong :
            \(\vec w\leftarrow\vec w+t\_n\vec\phi\lr{\vec x\_n}\)

Better algorithm: Gradient descent = steps in steepest direction; Stochastic gradient descent = GD w/ Ndist noise.

  • So long as there's a positive inner product with steepest, you'll converge the same.
  • But you can often get Ndist-noisy approximations much more cheaply than perfect gradient calculations.
  • Mini-batch size usually, like 256 -- divisible by 16, since that the number of threads on a GPU.

(some things about Fisher's Linear Discriminant Analysis)

\(M_5\): Classification

Bayes:\[\f{\Prr{\vec x}{C_1}p(C_1)}{\Prr{\vec x}{C_1}+\Prr{\vec x}{C_2}\Pr(C_2)}\]
Bishop: given \(a\dq=\log\f{\Prr{\vec x}{C_1}}\Pr(C_1)}{\Prr{\vec x}{C_2}\Pr(C_2)}\),
Fun fact:
\[\grad\sg\lr{\vec a}=\sg\lr{\vec a}\lr{1-\sg\lr{\vec a}}\]

Naive Bayes: Dimensions are uncorrelated, ex class. So
\[\Prr{\vec x}{C_k}=\prod_{n=1}^D\Prr{x_n}{C_k},\]
given Normal noise:
(much uninteresting math)

\(W_5\): Neural Networks

  • Data: \(\s{\lr{\vec x_n,t_n}}_{n=1}^N\sbs P\lr{\R^D\x\R}\)
  • Vanilla LinReg: \(y\lr{\vec x,\vec w}=\vec w\vtr\vec x\)
  • Basis functions: \(\phi_j:X\to\R\im y\lr{\vec x_i,\vec w}=\vec w\vtr\vec\phi\lr{\vec x}\) -- but how to pick the \(\phi_j\)
  • Neural net="deep learning"=adaptive bsis function regression
  • Add parameters to \(\s{\phi_j}\), learned from the data
  • Typically, we'll use \(\phi_j\lr{\vec x}=\sg\lr{\vec w_j\vtr\vec x}\) -- e.g. \(\sg\) can be tanh, logistic, re(ctified) l(inear) u(nit): "hard" \(\sg(z)=\max\s{0,z}\), "soft" \(\sg(z)=\log\lr{1+e^z}\)

Q: At a high level, I'm having difficulty understanding why I'd want to choose one over the other.

A: Me too, man. ReLUs are neat for vision, because you'd like classifications to be invariant under "make everything brighter", but opinions in other cases are very, very mixed.

Q: Can you fit the activation function?

A: Sure, that's some of my research...but it gets kinda hairy.

Q: Why do you use activation functions at all?

A: Well, if you only used linear (or identity) activations, you'd only recover LinReg...

  • Input layer \(\to\) hidden layer \(\to\) output layer
  • In general, it's a particular connectivity DAG

Q: Why would you ever need more layers?

A: Well, the whole schtick of deep learning is that the early layers perform adaptive feature extraction, later layers reason about elementary conjunctions of those features, and later-still layers then do abstract reasoning on that.

Q: Are there heuristics about how many layers to use?

A: No. Or maybe: three layers, and then make them as big as you can fit in your GPU.

So, we need a gradient... We could

  • Do calculus
  • Symbolic differentiation
  • Finite differences