Question on Quora -How can overfitting be avoided in neural networks?

Early stopping

A number of techniques have been developed to further improve ANN generalization capabilities, including: different variants of cross-validation (Haykin, 1999), noise injection (Holmstrom and Koistinen, 1992), error regularization, weight decay (Poggio and Girosi, 1990; Haykin, 1999) and the optimized approximation algorithm (Liu et al., 2008).

A number of cross-validation variants exists, some of them are of special attention when data are very scarce, i.e. multifold cross-validation or leave-one-out (Haykin, 1999). But probably the most popular in practical applications (Liu et al., 2008) is the so-called early stopping. To use early stopping approach, apart from the training data set (e.g. data used during optimization) and the testing set (not presented to the model during optimization), the validation set is required to define stopping criteria of the optimization algorithm. The ANN learning terminates when error increases for validation data, although it often continues to decrease for training data set. When error calculated for validation data increases, while calculated for training data decreases, it is considered as fitting to the noise present in the data, instead of signal, in other words – overfitting. However, it is not easy to decide when exactly to stop training. Immediate stopping when the error for validation data starts to increase for the first time during optimization process usually results in under-fitting of ANN. Prechlet (1998) proposed three classes of stopping criteria to avoid both under-fitting and subjectivity of decision when exactly to stop training.

The early stopping is one of the methods used in the present paper, due to its popularity and large amount of available data which make application of other cross-validation-based methods less significant. The experience of the authors (Rowinski and Piotrowski, 2008; Piotrowski and Napiorkowski, 2011) suggests to adopt the simplest, so-called Generalization Loss class proposed by Prechlet (1998) and terminate training when validation error exceeds its previously noted minimum value by 20%. Then the previously noted solution with the lowest objective function value for validation set is considered as the optimal one.

Noise injection

Different techniques of noise injection to data during ANN optimization were discussed in a number of papers (Holmstrom and Koistinen, 1992; Grandvalet et al., 1997; Skurichina et al., 2000; Brown et al., 2003; Seghouane et al., 2004). In practical applications, the methodological variants of adding Gaussian noise to input data proposed by Holmstrom and Koistinen (1992)becomes the most popular. Noise injection was found to improve ANN generalization ability (Sietsma and Dow, 1991; An, 1996), especially in the case of classification problems for small data samples (Hua et al., 2006; Zur et al., 2009). The close similarities between noise injection and other methods designed to improve generalization properties of ANNs, including error regularization, were theoretically studied by Reed et al. (1995) and confirmed empirically by Zur et al. (2009). As a result, regularization methods are not considered in the present comparative study.

Optimization approximation algorithm

Optimized approximation algorithm (OAA) (Liu et al., 2008) is a recently proposed method, conceptually much different from the previously described ones. The OAA was developed to stop ANN training without the use of validation set or any disturbance to the measured data. The stopping criterion is based on the relation between the value of easily computable coefficient called signal-to-noise-ratio-figure (SNRF) determined by modelling errors at each iteration and SNRF threshold value determined by the sample size N only (SNRFN). The method was introduced for one-dimensional approximation of continuous functions and extended to the more practical multidimensional case.

Dropout Regularization

Dropout layers provide a simple way to avoid overfitting. https://www.cs.toronto.edu/~hint... The primary idea is to randomly drop components of neural network (outputs) from a layer of neural network. This results in a scenario where at each layer more neurons are forced to learn the multiple characteristics of the neural network.

The true strength of drop out comes when we have multiple layers and many neurons in each layers. For a simple case, if a network has 2 layers and 4 neurons in each layer, then we are over training process making sure than 4C2 X 4C2 = 36 different models learn the same relation, and during prediction are taking average of predictions from 36 models. The strength comes from the fact that when we have many hidden layers and hidden neurons, we end up with a situation where NC(N/2)^h models learn the relation between data and target, which has the effect of taking enseamble over NC(N/2)^h models. For a 2 layer model with 100 neurons in each layer, this results in a scenario where we are taking average over 24502500 possible models.

Classic way

The "classic" way to avoid overfitting is to divide your data sets into three groups -- a training set, a test set, and a validation set. You find the coefficients using the training set; you find the best form of the equation using the test set, test for over-fitting using the validation set. Be careful not to use the validation set until after you have picked the best form of fit. See: http://en.wikipedia.org/wiki/Test_set

Regularization (Normal)

Regularization modifies the objective function that we minimize by adding additional terms that penalize large weights. In other words, we change the objective function so that it becomes Error+λf(θ), where f(θ) grows larger as the components of θ grow larger and λ is the regularization strength (a hyper-parameter for the learning algorithm). The value we choose for λ determines how much we want to protect against overfitting. A λ=0 implies that we do not take any measures against the possibility of overfitting. If λ is too large, then our model will prioritize keeping θ as small as possible over trying to find the parameter values that perform well on our training set. As a result, choosing λ is a very important task and can require some trial and error.

The most common type of regularization is L2 regularization. It can be implemented by augmenting the error function with the squared magnitude of all weights in the neural network. In other words, for every weight w in the neural network, we add 1/2 λw^2 to the error function. The L2 regularization has the intuitive interpretation of heavily penalizing "peaky" weight vectors and preferring diffuse weight vectors. This has the appealing property of encouraging the network to use all of its inputs a little rather than using only some of its inputs a lot. Of particular note is that during the gradient descent update, using the L2 regularization ultimately means that every weight is decayed linearly to zero. Because of this phenomenon, L2 regularization is also commonly referred to as weight decay. Another common type of regularization is L1 regularization. Here, we add the term λ|w| for every weight win the neural network. The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a small subset of their most important inputs and become quite resistant to noise in the inputs.

In comparison, weight vectors from L2 regularization are usually diffuse, small numbers. L1regularization is very useful when you want to understand exactly which features are contributing to a decision. If this level of feature analysis isn't necessary, we prefer to use L2 regularization because it empirically performs better.

Max norm constraints

Max norm constraints have a similar goal of attempting to restrict from θ becoming too large, but they do this more directly. Max norm constraints enforce an absolute upper bound on the magnitude of the incoming weight vector for every neuron and use projected gradient descent to enforce the constraint. In other words, anytime a gradient descent step moved the incoming weight vector such that ||w||2>c, we project the vector back onto the ball (centered at the origin) with radius c. One of the nice properties is that the parameter vector cannot grow out of control (even if the learning rates are too high) because the updates to the weights are always bounded.

Note:

With early stopping, the choice of the validation set is also important. The validation set should be representative of all points in the training set.When you use Bayesian regularization, it is important to train the network until it reaches convergence. The sum-squared error, the sum-squared weights, and the effective number of parameters should reach constant values when the network has converged.
With both early stopping and regularization, it is a good idea to train the network starting from several different initial conditions. It is possible for either method to fail in certain circumstances. By testing several different initial conditions, you can verify robust network performance.
When the data set is small and you are training function approximation networks, Bayesian regularization provides better generalization performance than early stopping. This is because Bayesian regularization does not require that a validation data set be separate from the training data set; it uses all the data.