History and applications of Deep Learning

Back to episodes

Listen now

March 14, 2016

by Francesco Gadaleta

Produced by: worldofpiggy.com

Support us

Did you like the show?
Please support us with a small donation. We will really appreciate!

What is deep learning?

If you have no patience, deep learning is the result of training many layers of non-linear processing units for feature extraction and data transformation e.g. from pixel, to edges, to shapes, to object classification, to scene description, captioning, etc.


As old as the 80s! Then why this approach has been abandoned for a while? The answer is in the lack of big training data and computing power in the early days. However, five major events occurred in the past and all of them contributed to define and make what we today call deep learning possible.

  • Fukushima’s Neocognitron introduced convolutional neural networks partially trained by unsupervised learning with human-directed features in the neural plane.

  • Backpropagation Yann LeCun et al. (1989) (* check Errata) applied supervised backpropagation to such architectures. Weng et al. (1992) published convolutional neural networks Cresceptron for 3-D object recognition from images of cluttered scenes and segmentation of such objects from images.

  • Max-pooling (1992) appeared to be first proposed by Cresceptron to enable the network to tolerate small-to-large deformation in a hierarchical way, while using convolution. Max-pooling helps, but does not guarantee, shift-invariance at the pixel level.

People tried to train deep networks and they mostly failed. Why? Sepp Hochreiter ‘s diploma thesis of 1991 formally identified the reason for this failure as the vanishing gradient problem , which affects many-layered feedforward networks and recurrent neural networks

Pre-training (Geoffrey Hinton)

Other methods use unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised back-propagation to classify labeled data. The deep model of Hinton et al. (2006) involves learning the distribution of a high-level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine (Smolensky, 1986) to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly.

Looks like a tongue-twister, right? Well, it basically says that if trained well, a network can generate data that are similar to the ones that were fed from the training set. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an “ancestral pass”) from the top level feature activation. Hinton reports that his models are effective feature extractors over high-dimensional, structured data.

What are vanishing gradients?

Recurrent networks are trained by unfolding them into very deep feed forward networks, where a new layer is created for each time step of an input sequence processed by the network. As errors propagate from layer to layer, they shrink exponentially with the number of layers, impeding the tuning of neuron weights which is based on those errors (LSTMs were proposed as a solution in 1997)

And now, the power of deep learning

One of the promises of deep learning is replacing handcrafted features with efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction.

For supervised learning tasks, deep learning methods obviate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures which remove redundancy in representation. Moreover, PCA is a linear method that will ignore non-linearities of the data. Many deep learning algorithms are applied to unsupervised learning tasks. This is an important benefit because unlabeled data are usually more abundant than labeled data eg. autoencoders.

Build your deep learning machine

Build your deep learning machine

Google photos search images by text

Automatically color bw images

Teradeep real-time object classifier like Terminator 1984


As Paul P. suggested, Rumelhart, Hinton, and Williams should be credited with discovering back-propagation, not LeCun.

References are from Hinton Backprop

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986) Learning representations by back-propagating errors. Nature, 323, 533–536

Hinton, G. E. (1986) Learning distributed representations of concepts. Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, Mass. Reprinted in Morris, R. G. M. editor, Parallel Distributed Processing: Implications for Psychology and Neurobiology, Oxford University Press, Oxford, UK

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986) Learning internal representations by error propagation. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations Volume 1: Foundations, MIT Press, Cambridge, MA.