Tips and tricks for Neural Networks

Training neural networks is a complex procedure. Many variables work with each other, and often it’s unclear what works.

Tips and tricks for Neural Networks
An overview of the techniques. The list is available on Notion here.

The following selection of tips aims to make things easier for you. It’s not a must-do list but should be seen as an inspiration. You know the task at hand and can thus best select from the following techniques. They cover a wide area: from augmentation to selecting hyperparameters; many topics are touched upon. Use this selection as a starting point for future research.

Overfit a single batch

Use this technique to test your network’s capacity. First, take a single data batch, and make sure that it is labelled correctly (if labels are used). Then, repeatedly fit this single batch until the loss converges. If you do not reach perfect accuracy (or similar metrics), you should look at your data. Simply using a larger network is usually not the solution.

Increase the number of epochs by a significant factor

Often, you benefit from running your algorithm for a larger number of steps. If you can afford to run your training longer, scale the number of epochs from, e.g., 100 to, say, 500. If you observe a benefit from longer training times, you can begin choosing more sensible values.

Set seeds

To ensure reproducibility, set the seeds of any random-number generating operation. For example, if you work with TensorFlow, you can utilize the following snippet:

Rebalance the dataset

An imbalanced dataset has one or more major classes that make up a large part of the dataset. Conversely, one or more minor classes only contribute a few samples. If you are working on data with similar characteristics, consider rebalancing your dataset. Recommended techniques are oversampling the minority classes, downsampling the major classes, collecting additional samples (if possible), and generating artificial data with augmentation.

Use a neutral class

Consider the following case: You have a dataset of two classes, “healthy” and “not healthy.” The samples are hand-labelled by domain experts. If one of them is unsure about the appropriate label, he might assign none or only with little confidence. In this case, it’s a good idea to introduce a third, neutral class. This additional class represents the “I am not sure” label. During training, you can exclude this data. Afterwards, you can let the network pre-label these vague samples and show them to the domain experts.

Set the bias of the output layer

For unbalanced datasets, the initial guesses of the network inevitably fall short. Even though the network learns to account for this, setting a better bias at model creation can reduce training times. For the sigmoid layer, the bias can be calculated as (for two classes):

bias = log(pos/negative)

When you create the model, you set this value as the initial bias.

Overfit for physical simulations

To simulate the movement of fluids, one often uses special software. In complicated interactions (e.g., water flowing over uneven ground), it can take a long to see results. Neural networks can be of help here. Since the simulations follow physic’s laws, there is zero chance that anything magical will happen — it only takes effort to calculate the outcome.

The network can learn this physical simulation. Because the laws are well-defined, we “only” require the network to overfit. We don’t expect any unseen test samples because they have to follow the same rules. In this case, it’s thus helpful to overfit the training data; often, no test data is even required. Once the network is trained, we use it to replace the slow simulator.

Tune the learning rate

If you look for any hyperparameter to tune, then primarily focus on the learning rate. The following plot shows the effect of a learning rate set too high:

With a learning rate set too high, the model does not converge.

In contrast, using a different, smaller learning rate, the development is as desired:

With a better-suited learning rate, the training converges.

Use fast data pipelines

For small projects, I often use a custom generator. When I work on larger projects, I usually replace them with dedicated dataset mechanisms. In the case of TensorFlow, this is the tf.data API. It includes all required methods like shuffling, batching, and prefetching. Relying on code written by many experts, as opposed to a custom solution, gives me time for the actual task.

Use data augmentation

Augment your training data to create a robust network, increase the dataset size, or oversample minor classes. These benefits come at the cost of increased training times, especially if the augmentation is done on the CPU. If you can move it to the GPU, you will see results faster.

Use AutoEncoders to extract embeddings

If your labelled dataset is relatively small, you can still employ some tricks. One of them is training a sep AutoEncoder. The background is that it is easier to collect unlabelled data additionally than labelling them. You then use the AutoEncoder with a sufficiently sized latent space (e.g., 300 to 600 entries) to achieve a reasonably low reconstruction loss. To obtain embeddings for your actual data, you discard the decoder network. You can then use the remaining encoder network to generate embeddings. It’s your decision if you add this decoder to your primary network or only use it to extract embeddings.

Use embeddings from other models

Rather than learning embeddings for your data from scratch, you can use embeddings learned by other models. This approach is related to the technique proposed above. For text data, it’s common to download pre-trained embeddings. For images, you can use large networks that are trained on ImageNet. Choose a sufficient layer and cut everything afterwards, and use the output as the embedding.

Use embeddings to shrink data

Let’s assume our data points all have a categorical feature. In the beginning, it can take two possible values, so a one-hot encoding has two indices. But once this grows to 1000 or more possible values, a sparse one-hot encoding is no longer efficient. Because they can represent such data in a lower-dimensional space, embeddings are useful here. The embedding layer takes the categorical value (going from 0 to 1000 in our case) and outputs a vector of floating points, the embedding. This representation is learned during training and serves as input to successive network layers.

Use checkpoints

Nothing is more frustrating than running an expensive training algorithm for countless hours and then seeing it crash. Sometimes, it might be a hardware failure, but often it’s a code issue — one that you only see at the training’s end. While you can never expect only perfect runs, you can nonetheless prepare by saving checkpoints. In their basic form, these checkpoints store the model’s weights every k steps. You can expand them also to keep the optimizer state, the current epoch, and any other crucial information. Then, at the start of a training run, you check for any failed run artefacts and restore all necessary settings. This works very well in combination with custom training loops.

Write custom training loops

In most cases, using the default training routines, such as model.fit(…) in TensorFlow, is sufficient. However, what I have often noticed is limited flexibility. Some minor changes might be easy to incorporate, but major modifications are hard to implement.

That’s why I generally propose writing custom algorithms. At first, it might sound daunting, but extensive tutorials are available to help you get started. The first few times you follow this approach, you might be temporarily slowed down. But once you have experience, you are rewarded with greater flexibility and understanding. Furthermore, this knowledge allows you to modify your algorithm quickly, integrating your newest ideas.

If you are looking for more information and code related to custom training loops in TensorFlow, then have a look at my post here:

Pascal Janetzky - Private Site Access

Set the hyperparameters appropriately

Modern GPUs excel at matrix operations, which is why they are widely used to train large networks. By choosing hyperparameters appropriately, you can further improve the algorithms’ efficiency. For Nvidia GPUs (which are the prominent accelerators in use today), you can use the following guidelines:

  • Choose the batch size to be divisible by 4 or greater multiples of 2
  • for dense layers, set the input (from the previous layer) and output to be divisible by 64 and more
  • for convolutional layers, set input and output channels to be divisible by 4 or greater multiples of 2
  • for conv. layers, choose input and output to be divisible by 64 and more
  • pad image inputs from 3 (RGB) to 4 channels
  • use a batch size x height x width x channels layout
  • for recurrent layers, set the batch and hidden size to be divisible by at least 4, ideally by any of 64, 128, or 256
  • for recurrent layers, choose large batch sizes

These suggestions follow the idea of making data distribution more evenly. Mainly, you achieve this by choosing the values to be multiples of 2. The larger you set this number, the more efficient your hardware runs. You can find more information here, here, and here.

Use EarlyStopping

The question “when do I stop the training” is challenging to answer. A phenomenon that can occur is the deep double descent: Your metrics start to worsen after a steady improvement. Then, after some updates, the scores improve once again, getting even better than before. To not stop the run in between, you can use a validation dataset. This separate dataset is used to measure the performance of your algorithm on new, unseen data. If the performance does not improve for patience number of steps, the training is automatically stopped. If you choose the patience parameter well, you can overcome temporary plateaus. A good start value is 5 to 20 epochs.

Use transfer-learning

The idea behind transfer learning is to utilize models that practitioners trained on massive datasets and apply them to your problems. Ideally, the network you use has been trained on the same data type (images, text, audio) and for tasks similar to yours (classification, translation, detection). There are two related approaches:

Fine-tuning

Fine-tuning is the task of taking an already trained model and updating the weights for your specific problem. Commonly, you freeze the first couple of layers since they were trained to recognize basic features. The remaining layers are then fine-tuned on your dataset.

Feature extraction

In contrast to fine-tuning, feature extraction describes an approach where you use the trained network to extract features. On top of the pre-trained model, you add your own classifier, and only this part of the network is updated; the base layers are frozen. You follow this approach because the original top was trained for a specific problem, but your task might differ. By learning a custom top from scratch, you make sure to focus on your dataset — while maintaining the benefits of a large base model.

Use data-parallel multi-GPU training

If you have access to more than one accelerator, you can speed up your training by running the algorithms on multiple GPUs. Typically, this is done in a data-parallel fashion: The network is replicated on the different devices, and the batch is split and distributed. The gradients are then averaged and applied to each network copy. In TensorFlow, you have multiple options regarding distributed training. The easiest option is to use the MirroredStrategy, but there are more strategies. For example, you can follow these tutorials if you write custom training loops (as I propose somewhere above). I noticed the largest speed-up when going from one to two and from two to three GPUs. For large datasets, this is a quick approach to minimizing train times.

Use sigmoid for multi-label settings

In cases where a sample can have more than one label, you can use the sigmoid activation function. Different to the softmax function, sigmoid is applied to each neuron individually, which means that multiple neurons can fire. The output values are bounded between 0 and 1, making an interpretation easy. This property is helpful, e.g., to classify a sample into multiple classes or to detect various objects.

Use one-hot encoding for categorical data

As we require a numerical representation, categorical data has to be encoded as numbers. For example, we can not directly feed “bank,” but have to use an alternative representation. A tempting option is to enumerate all possible values. This approach, however, implies an ordering between “bank,” encoded as 1, and “tree,” encoded as 2. Such an ordering rarely is present, which is why we rely on one-hot vectors to encode the data. This method ensures that the variables are independent.

Use one-hot encoding for indices

Assume you are trying to predict the weather and index the days: 1 for Monday, 2 for Tuesday, etc. However, because it is just an arbitrary index, we can better use one-hot encoding. Similar to the previous tip, this representation builds no relationship between the indices.

(Re-)Scale numerical values

Networks are trained by updating the weights, and the optimizers are responsible for that. Often, they are tuned to perform best if the values lie between [-1, 1]. Why is that? Let’s think about a hilly landscape, and we search for the lowest point. The hillier the area is, the more time we have to spend searching for a global minimum. However, what if we could modify the shape of the landscape? So we could find a solution faster?

That is what we do by rescaling the numerical values. When we scale the values to [-1, 1], we make the curvature more spherical (rounder, more even). If we train our model with data from this range, we converge faster.

Why is that? The size of the features (i.e., the value) influences the size of the gradient. And larger features produce larger gradients, which leads to large weight updates. These updates require more steps to converge, which slows training.

For more information, have a look at TensorFlow’s tutorial here.

Use knowledge-distillation

You have surely heard of the BERT model, haven’t you? This Transformer has a couple of hundred million parameters, but we might not be able to train it on our GPUs. That is where the process of distilling knowledge becomes useful. We train a second model to produce the output of the larger model. The input is still the original dataset, but the label is the output of the reference model, called soft output. The goal of this technique is to replicate the larger model with the help of the small model.

The question is: Why not train the small model directly? First, training smaller models, especially in the NLP domain, is more complicated than training the larger models. Secondly, the large model might be overkill for our problem: It is powerful enough to learn what we need, but it could learn even more. On the other hand, the smaller model is hard to train but sufficient to store the required knowledge. We, therefore, distil the knowledge of the extensive reference model into our small secondary model. We then benefit from reduced complexity while still coming close to the original quality.

If you want to learn more, you can look at Hugginface’s more detailed description here.