History and applications of Deep Learning: A New Podcast Episode

What is deep learning?

If you have no patience, deep learning is the result of training many layers of
non-linear processing units for feature extraction and data transformation
eg from pixel, to edges, to shapes, to object classification, to scene description, captioning, etc.


As old as the 80s! Then why this approach has been abandoned for a while?
The answer is in the lack of big training data and computing power in the early days.
However, five major events occurred in the past and all of them contributed to define and make what we today call deep learning possible.

  1. Fukushima’s Neocognitron introduced convolutional neural networks partially trained by unsupervised learning with human-directed features in the neural plane.
  2. Backpropagation Yann LeCun et al. (1989) applied supervised backpropagation to such architectures. Weng et al. (1992) published convolutional neural networks Cresceptron for 3-D object recognition from images of cluttered scenes and segmentation of such objects from images.
  3. Max-pooling (1992) appeared to be first proposed by Cresceptron to enable the network to tolerate small-to-large deformation in a hierarchical way, while using convolution. Max-pooling helps, but does not guarantee, shift-invariance at the pixel level.
  4. People tried to train deep networks and they mostly failed. Why?
    Sepp Hochreiter‘s diploma thesis of 1991 formally identified the reason for this failure as the vanishing gradient problem, which affects many-layered feedforward networks and recurrent neural networks.

    What are vanishing gradients?
    Recurrent networks are trained by unfolding them into very deep feed forward networks, where a new layer is created for each time step of an input sequence processed by the network. As errors propagate from layer to layer, they shrink exponentially with the number of layers, impeding the tuning of neuron weights which is based on those errors (LSTMs were proposed as a solution in 1997)


  5. Pre-training (Geoffrey Hinton)
    Other methods use unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised back-propagation to classify labeled data. The deep model of Hinton et al. (2006) involves learning the distribution of a high-level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine (Smolensky, 1986) to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly.
    Looks like a tongue-twister, right? Well, it basically says that if trained well, a network can generate data that are similar to the ones that were fed from the training set.
    Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an “ancestral pass”) from the top level feature activation. Hinton reports that his models are effective feature extractors over high-dimensional, structured data.


And now, the power of deep learning.

One of the promises of deep learning is replacing handcrafted features with efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction.
For supervised learning tasks, deep learning methods obviate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures which remove redundancy in representation. Moreover, PCA is a linear method that will ignore non-linearities of the data. Many deep learning algorithms are applied to unsupervised learning tasks. This is an important benefit because unlabeled data are usually more abundant than labeled data eg. autoencoders.



Leave a Reply

Your email address will not be published. Required fields are marked *