What are kernels in machine learning and SVM and why do we need them?

A lot of awe and mysticism is associated with kernels, but we think they are not that big a deal. Kernels are a combination of two good ideas, they have one important property and are subject to one major limitation. It is also unfortunate that support vector machines and kernels are tied so tightly together. Kernels are the idea of summing functions that imitate similarity (induce a positive-definite encoding of nearness) and support vector machines are the idea of solving a clever dual problem to maximize a quantity called margin. Each is a useful tool even without the other.

The two good ideas are related and unfortunately treated as if they are the same. The two good ideas we promised are:

The “kernel trick.” Adding new features/variables that are functions of your other input variables can change linearly inseparable problems into linearly separable problems. For example if our points were encoded not as u(i) = (x(i),y(i)) but as u(i) = (x(i),y(i),x(i)*x(i),y(i)*y(i),x(i)*y(i))
Often you don’t need the coordinates of u(i). You are only interested in functions of distances ||u(i)-u(j)|^2 and in many cases you can get at these by inner products and relations like ||u(i)-u(j)||^2 = <u(i),u(i)> + <u(j),u(j)> – 2<u(i),u(j)> .

We will expand on these issues later.

The important property is that kernels look like inner products in a transformed space. The definition of a kernel is: there exists a magic function phi() such that for all u,v:

This means that k(.,.) is behaving like an inner product in some (possibly unknown) space. The important consequence is the positive semi-definiteness, which implies k(u,u)≥0 for all u (and this just follows from the fact about inner products over the real numbers that <z,z>≥0 for all z). This is why optimization problems that use the kernel as their encoding are well formed (such as the optimization problem of maximizing margin which is how support vector machines work). You can “regularize”optimization problems with a kernel penalty because it behaves a lot like a norm. Without the positive semidefinite property all of these optimization problems would be able to “run to negative infinity” or use negative terms (which are not possible from a kernel) to hide high error rates. The limits of the kernel functions (not being able to turn distance penalties into bonuses) help ensure that the result of optimization is actually useful (and not just a flaw in our problem encoding).

And this brings us to the major limitations of kernels. The phi() transform can be arbitrarily magic except when transforming one vector it doesn’t know what the other vector is and phi() doesn’t even know which side of the inner product it is encoding. That is kernels are not as powerful as any of the following forms:

snooping (knowing the other):
positional (knowing which part of inner product mapping to):
fully general:

Not everything is a kernel.

Some non-kernels are:

k(u,v) = -c for any c>0
k(u,v) = ||u-v||
This can be shown as follows. Suppose k(u,v) is a kernel with k(u,u) = 0 for all u (which is necessary to try and match ||u-v|| as ||u-u|| = 0 for all u). By the definition
of kernels k(u,u) = < phi(u), phi(u) > for some real vector valued function phi(.). But, by the properties of the inner real inner product <.,.>, this means phi(u) is
the zero vector for all u. So k(u,v) = 0 for all u,v and does not match ||u-v|| for any u,v such that u ≠ v.

There are some obvious kernels:

k(u,v) = c for any c ≥ 0 (non-negative constant kernels)
k(u,v) = < u , v > (the identity kernel)
k(u,v) = f(u) f(v) for any real valued function f(.)
k(u,v) = < f(u) , f(v) > for any real vector valued function f(.) (again, the definition of a kernel)
k(u,v) = transpose(u) B v where B is any symmetric positive semi-definite matrix

And there are several subtle ways to build new kernels from old. If q(.,.) and r(.,.) are kernels then so are:

k(u,v) = q(u,v) + r(u,v)
k(u,v) = c q(u,v) for any c ≥ 0
k(u,v) = q(u,v) r(u,v)
k(u,v) = q(f(u),f(v)) for any real vector valued function f(.)
k(u,v) = lim_{k-> infinity} q_k(u,v) where q_k(u,v) is sequence of kernels and the limit exists.
k(u,v) = p(q(u,v)) where p(.) is any polynomial with all non-negative terms.
k(u,v) = f(q(u,v)) where f(.) is any function with an absolutely convergent Taylor series with all non-negative terms.

Most of these facts are taken from the excellent book: John Shawe-Taylor and Nello Cristianini’s “Kernel Methods for Pattern Analysis”, Cambridge 2004. We are allowing
kernels of the form k(u,v) = < phi(u), phi(v) > where phi(.) is mapping into an infinite dimensional vector space (like a series). Most of these facts can be checked by
imagining how to alter the phi(.) function. For example to add two kernels just build a larger vector with enough slots for all of the coordinates for the vectors encoding
the phi(.)’s of the two kernels you are trying to add. To scale a kernel by c multiply all coordinates of phi() by sqrt(c).

Multiplying two kernels is the trickiest. Without loss of generality assume q(u,v) = < f(u) ,f(v) > and r(u,v) = < g(u), g(v) > and f(.) and g(.) are both mapping into the same finite dimensional vector space R^m (any other situation can be simulated or approximated by padding with zeros and/or taking limits). Imagine a new vector function p(.) that maps into R^{m*m} such that p(z)_{m*(i-1) + j} = f(z)_i g(z)_j . It is easy to check that k(u,v) = < p(u) , p(v) > is a kernel and k(u,v) = q(u,v) r(u,v). (The reference proof uses tensor notation and the Schur product, but these are big hammers mathematicians use when they don’t want to mess around with re-encoding indices).

Note that one of the attractions of kernel methods is that you never have to actually implement any of the above constructions. What you do is think in terms of sub-routines (easy for computer scientists, unpleasant for mathematicians). For instance: if you had access to two functions q(.,.) and r(.,.) that claim to be kernels and you wanted the product kernel you would just, when asked to evaluate the kernel, just get the results for the two sub-kernels and multiply (so you never need to see the space phi(.) is implicitly working in).

This implicitness can be important (though far too much is made of it). For example the Gaussians we have been using throughout are kernels, but kernels of infinite dimension, so we can not directly represent the space they live in. To see the Gaussian is a kernel notice the following:

And for all c ≥ 0 each of the three terms on the right is a kernel (the first because the Taylor series of exp() is absolutely convergent and non-negative and the second two are the f(u) f(v) form we listed as obvious kernels). The trick is magic, but the idea is to use the fact that Euclidian squared distance breaks nicely into dot-products ( ||u-v||^2 = < u, u &gt + < v, v &gt – 2 < u , v >) and exp() converts addition to multiplication. It is rather remarkable that kernels (which are a generalization of inner products that induce half open spaces) can encode bounded concepts (more on this later when we discuss the VC dimension of the Gaussian).

The great benefit of the support vector machine is that with access only to the data labels and the kernel function (in fact only the kernel function evaluated at pairs of training datums) the support vector machine can quickly solve for the optimal margin and data weights achieving this margin.

For fun lets plug in new kernel called the cosine kernel:

(c ≥ 0).

This kernel can be thought of as having a phi(.) function that takes a vector z and adds an extra coordinate of sqrt(c) and then projects the resulting vector onto the unit sphere. This kernel induces concepts that look like parabolas, as seen in figure 1.

Figure 1: Cosine example concept

Figure 2 shows the sum-concept (add all discount functions with same weight) model. In this case it is far too broad (averaging of the cosine concepts which are not bounded concepts like the Gaussians) creates an overly wide model. The model not only generalizes poorly (fails to match the shape of the truth diagram) it also gets some of its own training data wrong.

Figure 2: Cosine kernel sum model

And figure 3 shows the excellent fit returned by the support vector machine. Actually an excellent fit determined by 4 support vectors (indicated by larger labels). Also notice the support vector machine using unbounded concepts generated a very good unbounded model (unbounded in the sense that both the blue and red regions are infinite). By changing the kernel or discount functions/concepts we changed the inductive bias. So any knowledge of what sort of model we want (one class bounded or not) should greatly influence our choice of kernel functions (since the support vector machine can only pick weights for the kernel functions, not tune their shapes or bandwidths).

Figure 3: Cosine Support Vector model

Another family of kernels (which I consider afflictions) are the “something for nothing” kernels. These are kernels of the form:

or higher powers or other finishing functions. The idea is that if you were to look at the expansion of these kernels you would see lots of higher order terms (powers of u_i and v_j)
and if these terms were available to the support vector machine it could use them. It is further claimed that in addition to getting new features for free you don’t use up degrees
of freedom exploiting them (so you cross-validate as well as for simpler models). Both these claims are fallacious- you can’t fully use the higher order terms because they are entangled with other terms that are not orthogonal the outcome and the complexity of a kernel is not quite as simple as degrees of freedom (proofs about support vector machines are stated in terms of margin, not in terms of degrees of freedom or even in terms of VC dimension). The optimizer in the SVM does try hard to make a good hypothesis using the higher order terms- but due to the forced relations among the terms you get best fits like figure 4.

Figure 4: Squared magic kernel support vector model

If you want higher order terms I feel you are much better off performing a primal transform so the terms are available in their most general form. That is re-encode the vector u = (x,y) as the larger vector (x,y,x*y,x*x,y*y). You have to be honest: if you are trying to fit more complicated functions you are searching a larger hypothesis space so you need more data to falsify the bad hypotheses. You can be efficient (don’t add terms you don’t think you can use as they make generalization harder) but you can’t get something for nothing (even with kernel methods).

credit: John Mount