Neural networks are great. Though both shallow and deep neural networks are capable of approximating any function, the ML community has favored deeper networks for couple of reasons. To clear up definitions, a neural network is considered ‘deep’ if its made up of 2 or more hidden layers. Of course, this doesn’t say much when comparing the performance of a 152-layer convolutional neural network versus a 16 layered network (both are deep, but will perform differently).

It should go without proving that shallow networks, which contain one hidden layer, require more hidden units than layers in deeper networks. But that does not mean they generalize well.

You see, training a shallow network *works*. But with only one set of weights, there is not enough complexity captured. When a shallow network back-propagates, the weights simply adjust to directly memorize the output. When data flows through deeper networks, features and abstractions are uncovered, and learned.

Alright, cool. So why not make a neural network thats gajllion layers thick and gazillion layers deep? Well think about the consequences complexity wise. Regardless of your data size, training becomes a lengthy task. The number of weights or parameters goes up. Your relatively small dataset thinly spreads out onto a ton of parameters, ending up with noise-filled, overfitted model.

If we use the wrong activation function, we might fall into the vanishing gradient trap if the network is deep enough. For example I used scikit-learn to load up the built-in MNIST datset. Then I trained on neural networks which varied by number of layers and units. I calculated the cross-validation accuracy and plotted the matrix result below.

Notice how at the 6-layer network, we’re literally getting ~15-30% CV accuracy, as opposed to the relative decent accuracy for shallower networks. The derivative of the logistic function has a maximum value of 1/4. Now if we’re backpropagating through layers taking gradients, the weights will eventually dwindle as we reach the first few layers. Going forward from the start of the network will automatically screw up the input data with the tiny weights. There’s also something called the **exploding gradient problem**, where the activations’ gradients are greater than 1, which cause weights to get really big when the networks are really deep.

So what is the consensus? Keep your networks small, just enough so that they perform well on test data. Do not make them unnecessarily big.

Anyways, this was a relatively qualitative post. Uh, use at least two hidden layers for your network … ðŸ˜€