Home AGI Book index Ian Goodfellow - Deep Learning - 2016

Ian Goodfellow - Deep Learning - 2016

History / Edit / PDF / EPUB / BIB /
Created: 2017-01-01UTC00:000 / Updated: 2025-12-21UTC01:460 / Status: finished / Readability: technical / 73 min read (~14450 words)
artificial-general-intelligence

$\usepackage{amsmath}$
$\newcommand{\vector}{\boldsymbol}$

In the early days of artificial intelligence, the field rapidly tackled and solved problems that are intellectually difficult for human beings but relatively straightforward for computers - problems that can be described by a list of formal, mathematical rules
The true challenge to artificial intelligence proved to be ... problems that we solve intuitively, that feel automatic, ...
The book solution to this problem is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined in terms of its relation to simpler concepts
If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers
Abstract and formal tasks that are among the most difficult mental undertakings for a human being are among the easiest for a computer
One of the key challenges in artificial intelligence is how to get this informal knowledge into a computer
The difficulties faced by systems relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data (known as machine learning)
The performance of (these) simple machine learning algorithms depends heavily on the representation of the data they are given
Each piece of information included in the representation ... is known as a feature
Many artificial intelligence tasks can be solved by designing the right set of features to extract for that task, then providing these features to a simple machine learning algorithm
Representation learning: To discover not only the mapping from representation to output but also the representation itself
Learned representation often result in much better performance than can be obtained with hand-designed representations
An autoencoder is the combination of an encoder function that converts the input data into a different representation, and a decoder function that converts the new representation back into the original format
When it is nearly as difficult to obtain a representation as to solve the original problem, representation learning does not, at first glance, seem to help us
Deep learning solves this central problem in representation learning by introducing representations that are expressed in terms of other, simpler representations
Deep learning allows the computer to build complex concepts out of simpler concepts
Depth allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer's memory after executing another set of instructions in parallel
Networks with greater depth can execute more instructions in sequence
According to this view of deep learning, not all the information in a layer's activations necessarily encodes factors of variation that explain the input. The representation also stores state information that helps to execute a program that can make sense of the input
Three waves of development of deep learning
- 1940-1960 Cybernetics
- 1980-1990 Connectionism
- 2006-now Deep learning
The neural perspective on deep learning is motivated by two main ideas:
- The brain provides a proof by example that intelligent behavior is possible, and a conceptually straightforward path to building intelligence is to reverse engineer the computational principles behind the brain and duplicate its functionality
- It would be deeply interesting to understand the brain and the principles that underlie human intelligence, so machine learning models that shed light on these basic scientific questions are useful apart from their ability to solve engineering applications
Neuroscience has given us a reason to hope that a single deep learning algorithm can solve many different tasks
We are able to draw some rough guidelines from neuroscience
- The basic idea of having many computational units that become intelligent only via their interactions with each other is inspired by the brain
- Greater neural realism has not yet led to an improvement in machine learning performance
- We do not yet know enough about biological learning for neuroscience to offer much guidance for the learning algorithms we use to train neural network architectures
One should not view deep learning as an attempt to simulate the brain
The central idea in connectionism is that a large number of simple computational units can achieve intelligent behavior when networked together
Distributed representation: Each input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs
The amount of skill required (to get good performances from a deep learning algorithm) reduces as the amount of training data increases
A rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5000 labeled examples per category, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples
One of the main insights of connectionism is that animals become intelligent when their neurons work together
Unless new technologies allow faster scaling, artificial neural networks will not have the same number of neurons as the human brain until at least 2050s

There are three possible sources of uncertainty:
- Inherent stochasticity in the system being modeled
- Incomplete observability
- Incomplete modeling
The expectation or expected value of some function f(x) with respect to a probability distribution P(x) is the average or mean value that f takes on when x is drawn from P. For discrete variables this can be computed with a summation
$$ \mathbb{E}_{\text{x} \sim P}[f(x)] = \sum_x P(x)f(x)$$
For continuous variables, it is computed with an integral
$$ \mathbb{E}_{\text{x} \sim p}[f(x)] = \int p(x)f(x)\,dx$$

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred
The self-information of an event $\text{x} = x$ is
$$ I(x) = -\log P(x)$$
We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy:
$$ H(\text{x}) = \mathbb{E}_{\text{x} \sim P}[I(x)] = - \mathbb{E}_{\text{x} \sim P}[\log P(x)]$$

Ascending an objective function of discrete parameters is called hill climbing
A Lipschitz continuous function is a function f whose rate of change is bounded by a Lipschitz constant $\mathcal{L}$
$$ \forall \vector{x}, \forall \vector{y}, |f(\vector{x}) - f(\vector{y})| \le \mathcal{L}||\vector{x}-\vector{y}||_2$$
This property is useful because it allows us to quantify our assumption that a small change in the input made by an algorithm such as gradient descent will have a small change in the output

A computer is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
The process of learning itself is not the task
Learning is our means of attaining the ability to perform the task
Machine learning tasks are usually described in terms of how the machine learning system should process an example
An example is a collection of features that have been quantitatively measured from some object or event that we want the machine learning system to process
Common machine learning tasks:
- Classification
- Classification with missing inputs
- Regression
- Transcription
- Machine translation
- Structured output
- Anomaly detection
- Synthesis and sampling
- Imputation of missing values
- Denoising
- Density estimation or probability mass function estimation
Unsupervised learning involves observing several examples of a random vector $\textbf{x}$, and attempting to implicitly or explicitly learn the probability distribution $p(\textbf{x})$, or some interesting properties of that distribution, while supervised learning involves observing several examples of a random vector $\textbf{x}$ and an associated value or vector $\textbf{y}$, and learning to predict $\textbf{y}$ from $\textbf{x}$, usually by estimating $p(\textbf{y}\ |\ \textbf{x})$
The ability to perform well on previously unobserved inputs is called generalization
What separates machine learning from optimization is that we want the generalization error, also called the test error, to be low as well as the training error
The generalization error is defined as the expected value of the error on a new input
The i.i.d. assumptions
- The example in each dataset are independent from each other
- The training and test sets are identically distributed, that is, drawn from the same probability distribution
Two central challenges in machine learning: underfitting and overfitting
- Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set
- Overfitting occurs when the gap between the training error and the test error is too large
We can control whether a model is more likely to overfit or underfit by altering its capacity
A model capacity is its ability to fit a wide variety of functions
One way to control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm is allowed to select as being the solution
Expected generalization error can never increase as the number of training examples increases
(The no free lunch theorem) means that the goal of machine learning research is not to seek a universal learning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what kinds of distributions are relevant to the "real world" that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about
Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error
Training set:
- 80% training examples
- 20% validation examples (to optimize hyperparameters)
k-fold cross-validation: a partition of the dataset is formed by splitting it into k non-overlapping subsets. The test error may then be estimated by taking the average test error across k trials
We often estimate the generalization error by computing the sample mean of the error on the test set
Bias and variance measure two different sources of error in an estimator
- Bias measures the expected deviation from the true value of the function or parameter
- Variance provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause

Under appropriate conditions, the maximum likelihood estimator has the property of consistency, meaning that as the number of training examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter. These conditions are:
- The true distribution $p_{data}$ must lie within the model family $p_{model}(\cdot; \vector{\theta})$. Otherwise, no estimator can recover $p_{data}$
- The true distribution $p_{data}$ must correspond to exactly one value of $\vector{\theta}$. Otherwise, maximum likelihood can recover the correct $p_{data}$, but will not be able to determine which value of $\vector{\theta}$ was used by the data generating process
Consistent estimators can differ in their statistic efficiency, meaning that one consistent estimator may obtain lower generalization error for a fixed number of samples m, or equivalently, may require fewer examples to obtain a fixed level of generalization error
When the number of examples is small enough to yield overfitting behavior, regularization strategies such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited

Support Vector Machines (SVM) is similar to logistic regression in that it is driven by a linear function $\vector{w^\top x} + b$
One key innovation associated with SVM is the kernel trick
- The kernel trick consists of observing that many machine learning algorithms can be written exclusively in terms of dot products between examples
The kernel-based function is exactly equivalent to preprocessing the data by applying $\phi(\vector{x})$ to all inputs, then learning a linear model in the new transformed space
The kernel trick is powerful for two reasons:
- It allows us to learn models that are nonlinear as a function of $\vector{x}$ using convex optimization techniques that are guaranteed to converge efficiently
- The kernel function k often admits an implementation that is significantly more computationally efficient than naively constructing two $\phi(\vector{x})$ vectors and explicitly taking their dot products
The most commonly used kernel is the Gaussian kernel, also known as the radial basis function (RBF) kernel
A major drawback to kernel machines is that the cost of evaluating the decision function is linear in the number of training examples, because the i-th example contributes a term to the decision function
- SVM are able to mitigate this by learning an $\vector{\alpha}$ vector that contains mostly zeros

k-nearest neighbors is a family of techniques that can be used for classification or regression
One weakness of k-nearest neighbors is that it cannot learn that one feature is more discriminative than another
Decision trees as they are typically used, with axis-aligned splits and constant outputs within each node, struggle to solve some problems that are easy even for logistic regression (e.g., a two class problem where $x_2 > x_1$, the decision boundary is not axis-aligned and the decision tree will thus need to approximate the decision boundary with many nodes, implementing a step function that constantly walks back and forth across the true decision function with axis-aligned steps)

There are multiple ways of defining a simpler representation
- Lower dimensional representations: attempt to compress as much information about x as possible in a smaller representation
- Sparse representations: embed the dataset into a representation whose entries are mostly zeroes for most inputs
- Independent representations: attempt to disentangle the sources of variation underlying the data distribution such that the dimensions of the representation are statistically independent

PCA learns a representation that has lower dimensionality than the original input
It also learns a representation whose elements have no linear correlation with each other
The k-means algorithm works by initializing k different centroids to different values, then alternating between two different steps until convergence
- In one step, each training example is assigned to cluster i, where i is the index of the nearest centroid $\vector{\mu}^{(i)}$
- In the other step, each centroid $\vector{\mu}^{(i)}$ is updated to the mean of all training examples $\vector{x}^{(j)}$ assigned to cluster i
Clustering may end up generating different clusters because of the feature that are used to create the clusters (red cars/trucks vs gray cars/trucks, red/gray cars vs red/gray trucks)

The k-means clustering algorithm divides the training set into k different clusters of examples that are near each other

The insight to stochastic gradient descent is that the gradient is an expectation. The expectation may be approximately estimated using a small set of samples

Nearly all deep learning algorithms can be described as a particular instance of a fairly simple recipe:
- a specification of a dataset
- a cost function
- an optimization procedure
- a model

The core idea in deep learning is that we assume that the data was generated by the composition of factors or features, potentially at multiple levels in a hierarchy

Manifold learning algorithms assume that most of $\mathbb{R}^n$ consists of invalid inputs, and that interesting inputs occur only along a collection of manifolds containing a small subset of points, which interesting variations in the output of the learned function occurring only along directions that lie on the manifold, or with interesting variations happening only when we move from one manifold to another
Observations in favor of the manifold hypothesis
- The probability distribution over images, text strings, and sounds that occur in real life is highly concentrated
- We can imagine neighborhoods and transformations of the manifold

It is best to think of feedforward networks as function approximation machines that are designed to achieve statistical generalization
The general principle of improving models by learning features extends beyond the feedforward networks. It is a recurring theme of deep learning that applies to all kinds of models
Multiple decisions
- Activation functions
- Network architecture
- How many layers?
- How the layers should be connected?
- How many units should be in each layer?
In modern neural networks, the default recommendation is to use the rectified linear unit (ReLU) defined by the activation function g(z) = max{0, z}

For feedforward neural networks, it is important to initialize all weights to small random values. The biases may be initialized to zero or to small positive values

Most modern neural networks are trained using maximum likelihood

The choice of cost function is tightly coupled with the choice of output unit

Objective functions that do not use a log to undo the exp of the softmax fail to learn when the argument to the exp becomes very negative, causing the gradient to vanish
In particular, squared error is a poor loss function for softmax units, and can fail to train the model to change its output, even when the model makes highly confident incorrect predictions
From a neuroscientific point of view, it is interesting to think of the softmax as a way to create a form of competition between the units that participate in it: the softmax outputs always sum to 1 so an increase in the value of one unit necessarily corresponds to a decrease in the value of others
Softmax => soft argmax

$g(z) = max\{0, z\}$
One drawback to rectified linear units is that they cannot learn via gradient-based methods on examples for which their activation is zero
Three generalizations of rectified linear units are based on using a non-zero slope $\alpha_i$ when $z_i < 0$: $h_i = g(\vector{z}, \vector{\alpha})_i = \text{max}(0, z_i) + \alpha_i \text{min}(0, z_i)$
- Absolute value rectification, $\alpha_i = -1$
- Leaky ReLU, $\alpha_i \approx 0.01$
- Parametric ReLU (PReLU), $\alpha_i$ as a learnable parameter
With large enough k, a maxout unit can learn to approximate any convex function with arbitrary fidelity
Maxout units typically need more regularization than rectified linear units
Maxout units can work well without regularization if the training set is large and the number of pieces per unit is kept low
ReLU and all of these generalizations of them are based on the principle that models are easier to optimize if their behavior is closer to linear

The widespread saturation of sigmoidal units can make gradient-based learning very difficult. For this reason, their use as hidden units in feedforward networks is now discouraged
The use of sigmoidal units as output units is compatible with the use of gradient-based learning when an appropriate cost function can undo the saturation of the sigmoid in the output layer
When a sigmoidal activation function must be used, the hyperbolic tangent activation function typically performs better than the logistic sigmoid

The word architecture refers to the overall structure of the network: how many units it should have and how these units should be connected to each other
Most neural networks are organized into groups of units called layers
Most neural network architectures arrange these layers in a chain structure, with each layer being a function of the layer that preceded it
The main considerations are to choose the depth of the network and the width of each layer

The universal approximation theorem states that a feedforward network with a linear output layer and at least one hidden layer with any "squashing" activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount, provided that the network is given enough hidden units
Borel measurability: any continous function on a closed and bounded subset of $\mathbb{R}^n$ is Borel measurable
The universal approximation theorem means that regardless of what function we are trying to learn, we know that a large MLP will be able to represent that function
- However, we are not guaranteed that the training algorithm will be able to learn that function
Learning can fail for two reasons:
- The optimization algorithm used for training may not be able to find the value of the parameters that corresponds to the desired function
- The training algorithm might choose the wrong function due to overfitting

Krylov methods are a set of iterative techniques for performing various operations like approximately inverting a matrix or finding approximations to its eigenvectors or eigenvalues, without using any operation other than matrix-vector products

17th century: Chain rule (Leibniz, L'Hôpital)
19th century: Gradient descent as a technique for iteratively approximating the solution to optimization problems (Cauchy)
1940s: Machine learning models such as the perceptron
Most of the improvement in neural network performance from 1986 to 2015 can be attributed to two factors:
- Larger datasets have reduced the degree to which statistical generalization is a challenge for neural networks
- Neural networks have become much larger, due to more powerful computers, and better software infrastructure
A small number of algorithmic changes have improved the performance of neural networks noticeably
- The replacement of mean squared error with the cross-entropy family of loss functions
- The replacement of the sigmoid hidden units with piecewise linear hidden units, such as the rectified linear units

Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error
- These strategies are known collectively as regularization
Regularization of an estimator works by trading increased bias for reduced variance

$$ \tilde{J}(\vector{\theta}; \vector{X}, \vector{y}) = J(\vector{\theta}; \vector{X}, \vector{y}) + \alpha \Omega(\vector{\theta})$$

We want to decrease both the original objective J on the training data and some measure of the size of the parameters $\vector{\theta}$
For neural networks, we typically choose to use a parameter norm penalty $\Omega$ that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized

The best way to make a machine learning model generalize better is to train it on more data
Injecting noise in the input to a neural network can be seen as a form of data augmentation

A very common form of multi-task learning is one where different supervised tasks (predicting $\textbf{y}^{(i)}$ given $\textbf{x}$) share the same input $\textbf{x}$, as well as some intermediate-level representation $\vector{h}^{(shared)}$ capturing a common pool of factors. The model can generally be divided into two kinds of parts and associated parameters:
- Task-specific parameters
- Generic parameters, shared across all the tasks

Every time the error on the validation set improves, we store a copy of the model parameters
When the training algorithm terminates, we return these parameters, rather than the latest parameters
The algorithm terminates when no parameters have improved over the best recorded validation error for some pre-specified number of iterations
Early stopping has the advantage over weight decay that early stopping automatically determines the correct amount of regularization while weight decay requires many training experiments with different values of its hyperparameter

While a parameter norm penalty is one way to regularize parameters to be close to one another, the more popular way is to use constraints: to force sets of parameters to be equal

Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by combining several models
The idea is to train several different models separately, then have all of the models vote on the output for test examples
This is a general strategy in machine learning called model averaging
Techniques employing this strategy are known as ensemble methods

Dropout trains the ensemble consisting of all sub-networks that can be formed by removing non-output units from an underlying base network
One advantage of dropout is that it is very computationally cheap
Another advantage of dropout is that it does not significantly limit the type of model or training procedure that can be used
Dropout is less effective with very little labeled training examples (<5000)
One of the key insights of dropout is that training a network with stochastic behavior and making predictions by averaging over multiple stochastic decisions implements a form of bagging with parameter sharing
Each hidden unit must be able to perform well regardless of which other hidden units are in the model

Training on adversarially perturbed examples from the training set (taking examples, modifying them slightly such that the current network will return a different label than the expected one)
The primary causes of these adversarial examples is excessive linearity
Adversarial training discourages highly sensitive locally linear behavior by encouraging the network to be locally constant in the neighborhood of the training data

Optimization algorithms that use the entire training set are called batch or deterministic gradient methods
Optimization algorithms that use only a single example at a time are sometimes called stochastic or sometimes online methods

Very small steps increase the cost function

The model identifiability problem: A model is said to be identifiable if a sufficiently large training set can rule out all but one setting of the model's parameter
Local minima can be problematic if they have high cost in comparison to the global minimum
A test that can rule out local minima as the problem is to plot the norm of the gradient over time. If the norm of the gradient does not shrink to insignificant size, the problem is neither local minima nor any other kind of critical point

We can think of a saddle point as being a local minimum along one cross-section of the cost function and a local maximum along another cross-section
For a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, the expected ratio of the number of saddle points to local minima grows exponentially with $n$

The cliff can be dangerous whether we approach it from above or below, but fortunately its most serious consequences can be avoided using the gradient clipping heuristic
The basic idea is that the gradient does not specify the optimal step size, but only the optimal direction within an infinitesimal region
Cliff structures are most common in the cost functions for recurrent neural networks, because such models involve a multiplication of many factors, with one factor for each time step

The vanishing and exploding gradient problem refers to the fact that gradients through computational graph that contains a path that consists of repeatedly multiplying by a matrix $W$ are also scaled according to $diag(\lambda)^t$
Vanishing gradients make it difficult to know which direction the parameters should move to improve the cost function
Exploding gradients can make learning unstable

$\epsilon_0$: Initial learning rate
$\tau$: the number of iterations until $\epsilon_\tau$ is reached
$\epsilon_\tau$: final learning rate (generally about 1% of $\epsilon_0$)
Typically, the optimal initial learning rate, in terms of total training time and the final cost value, is higher than the learning rate that yields the best performance after the first 100 iterations or so
The most important property of SGD is that computation time per update does not grow when the number of training examples becomes very large

The method of momentum is designed to accelerate learning, especially in the face of high curvature, small but consistent gradients, or noisy gradients
The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction
Momentum aims primarily to solve two problems:
- Poor conditioning of the Hessian matrix
- Variance in the stochastic gradient

A variant of the moment algorithm
The difference is where the gradient is evaluated
- Nesterov: the gradient is evaluated after the current velocity is applied

Training algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations
The initial parameters need to "break symmetry" between different units
- If two hidden units have the same initial parameters, then a deterministic learning algorithm applied to a deterministic cost and model will constantly update both of these units in the same way

The learning rate is reliably one of the hyperparameters that is the most difficult to set because it has significant impact on model performance
The delta-bar-delta algorithm is an early heuristic approach to adapting individual learning rates for model parameters during training
- The approach is based on a simple idea:
  - If the partial derivative of the loss, with respect to a given model parameter, remains the same sign, then the learning rate should increase
  - If the partial derivative with respect to that parameter changes sign, then the learning rate should decrease

Adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all their historical values
Empirically it has been found that the accumulation of squared gradients from the beginning of training can result in premature and excessive decrease in the effective learning rate

Modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average
Empirically, RMSProp has been shown to be an effective and practical optimization algorithm for deep neural networks

The name "Adam" derives from the phrase "adaptive moments"
It is perhaps best seen as a variant on the combination of RMSProp and momentum with a few important distinctions:
- Momentum is incorporated directly as an estimate of the first order moment (with exponential weighting) of the gradient
- Adam includes bias corrections to the estimates of both the first-order moments (the momentum term) and the (uncentered) second-order moments to account for their initialization at the origin
Adam is generally regarded as being fairly robust to the choice of hyperparameters, though the learning rate sometimes needs to be changed from the suggested default

Normalize (subtract the mean and divide by the standard deviation) every entry of a matrix $\vector{H}$ which is a minibatch of activations of the layer to normalize
In order to maintain the expressive power of the (neural) network, it is common to replace the batch of hidden unit actions $\vector{H}$ with $\vector{\gamma H'} + \vector{\beta}$ rather than the simply normalized $\vector{H'}
- It's easier to learn $\vector{\gamma}$ and $\vector{\beta}$ than the complicated interaction between the parameters in the layers below $\vector{H}$

We optimize one coordinate at a time
Block coordinate descent: Minimizing with respect to a subset of the variables simultaneously
Coordinate descent is not a very good strategy when the value of one variable strongly influences the optimal value of another variable

Consists of averaging together several points in the trajectory through parameter space visited by an optimization algorithm
The basic idea is that the optimization algorithm may leap back and forth across a valley several times without ever visiting a point near the bottom of the valley. The average of all of the locations on either side should be close to the bottom of the valley though

Strategies that involve training simple models on simple tasks before confronting the challenge of training the desired model to perform the desired task are collectively known as pretraining

To improve optimization, the best strategy is not always to improve the optimization algorithm
Most of the advances in neural network learning over the past 30 years (1986-2016) have been obtained by changing the model family rather than changing the optimization procedure
Modern neural nets have been designed so that their local gradient information corresponds reasonably well to moving toward a distant solution

Many of the challenges in optimization arise from the global structure of the cost function and cannot be resolved merely by making better estimates of local update directions
The predominant strategy for overcoming this problem is to attempt to initialize the parameters in a region that is connected to the solution by a short path through parameter space that local descent can discover
Continuation methods are a family of strategies that can make optimization easier by choosing initial points to ensure that local optimization spends most of its time in well-behaved regions of space
The idea behind continuation methods is to construct a series of objective functions over the same parameters
To solve complex cost functions, continuation methods would construct easier cost functions by "blurring" the original cost functions
This blurring preserves enough information about the location of a global minimum that we can find the global minimum by solving progressively less blurred versions of the problem
This approach can break down in three different ways
- It might successfully define a series of cost functions where the first is convex and the optimum tracks from one function to the next arriving at the global minimum, but it might require so many incremental cost functions that the cost of the entire procedure remains high
- The function might not become convex, no matter how much it is blurred
- The function may become convex as a result of blurring, but the minimum of this blurred function may track to a local rather than a global minimum of the original cost function

Convolutional networks are a specialized kind of neural networks for processing data that has a known, grid-like topology
Employs a mathematical operation called convolution
- Convolution is a specialized kind of linear operation

Continuous convolution
$$ s(t) = \int x(a)w(t - a)\,da \\ s(t) = (x * w)(t)$$
x is often referred as the input and w as the kernel
The output is sometimes referred to as the feature map
Discrete convolution
$$ s(t) = (x * w)(t) = \sum_{a = -\infty}^{\infty} x(a)w(t - a)$$
Cross-correlation is also known as convolution in many machine learning libraries

Convolution leverages three important ideas that can help improve a machine learning system
- Sparse interactions: Not every output unit interacts with every input unit
- Parameter sharing: The same parameter is used by more than one function in a model
- Equivariant representations: If the input changes, the output changes in the same way

A typical layer of a convolutional network consists of three stages:
- The layer performs several convolutions in parallel to produce a set of linear activations
- Each linear activation is run through a nonlinear activation function, such as the rectified linear activation function (also known as the detector stage)
- A pooling function is used to modify the output of the layer further
Replaces the output of the net at a certain location with a summary statistic of the nearby outputs
Pooling helps to make the representation become approximately invariant to small translations of the input
For many tasks, pooling is essential for handling inputs of varying size

An infinitely strong prior places zero probability on some parameters and says that these parameter values are completely forbidden, regardless of how much support the data gives to those values
We can think of the use of convolution as introducing an infinitely strong prior probability distribution over the parameters (weights) of a layer
Convolution and pooling can cause underfitting

Unshared convolution: similar operation to discrete convolution with a small kernel, but without sharing parameters across locations

Tiled convolution: a compromise between a convolutional layer and a locally connected layer

Convolution for processing variable sized inputs only makes sense for inputs that have variable size because they contain varying amounts of observations of the same kind
Convolution does not make sense if the input has variable size because it can optionally include different kind of observations

Typically, the most expensive part of convolutional network training is learning the features
One way to reduce the cost of convolutional network training is to use features that are not trained in a supervised fashion
Three strategies for obtaining convolutional kernels without supervised training:
- Initialize them randomly
- Design them by hand
- Learn the kernels with an unsupervised criterion (e.g., k-means clustering)

Recurrent neural networks are a family of neural networks for processing sequential data
Parameter sharing makes it possible to extend and apply the model to examples of different forms (different length) and generalize across them
Each member of the output is a function of the previous members of the output
Each member of the output is produced using the same update rule applied to the previous outputs

Unfolding is the application of a recurrent function to itself in order to obtain a new function that does not involve recurrence
Unfolding introduces two major advantages:
- Regardless of sequence length, the learned model always has the same input size, because it is specified in terms of transition from one state to another state, rather than specified in terms of variable-length history of states
- It is possible to use the same transition function f with the same parameters at every time step

Important design patterns for recurrent neural networks:
- Recurrent networks that produce an output at each time step and have recurrent connections between units
- Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step
- Recurrent networks with recurrent connections between hidden units, that read an entire sequence and then produce a single output

The advantage of eliminating hidden-to-hidden recurrence is that all the time steps are decoupled for any loss function based on comparing the prediction at time t to the training target at time t
- Training can thus be parallelized, with the gradient for each step t computed in isolation
Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing
Teacher forcing is a procedure in which during training the model receives the ground truth output $\vector{y}^{(t)}$ as input at time t+1

To compute the gradient through a recurrent neural network, simply apply the generalized back-propagation algorithm to the unrolled computational graph

It is difficult to predict missing values in the middle of a sequence
The price recurrent networks pay for their reduced number of parameters is that optimizing the parameters may be difficult
It relies on the assumption that the same parameters can be used for different time steps
The assumption is that the conditional probability distribution over the variables at time t+1 given the variables at time t is stationary

In many application we want to output a prediction $\vector{y}^{(t)}$ which may depend on the whole input sequence
Bidirectional RNNs combine an RNN that moves forward through time beginning from the start of the sequence with another RNN that moves backward through time beginning from the end of the sequence
Compared to a convolutional network, RNNs applied to images are typically more expensive but allow for long-range lateral interactions between features in the same feature map

We often call the input to the RNN the "context"
An encoder or reader or input RNN processes the input sequence
The encoder emits the context C, usually a simple function of its final hidden state
A decoder or writer or output RNN is conditioned on that fixed-length vector to generate the output sequence
There is no constraint that the encoder must have the same size of hidden layer as the decoder
One clear limitation of this architecture is when the context C output by the encoder RNN has a dimension that is too small to properly summarize a long sequence

Represent a generalization of recurrent networks with a different kind of computational graph, which is structured as a deep tree, rather than the chain-like structure of RNNs
One clear advantage of recursive nets over recurrent nets is that for a sequence of the same length $\tau$, the depth can be drastically reduced from $\tau$ to $O(\log \tau)$

The basic problem is that gradients propagated over many stages tend to either vanish or explode
The difficulty with long-term dependencies arises from the exponentially smaller weights given to long-term interactions compared to short-term ones
In order to store memories in a way that is robust to small perturbations, the RNN must enter a region of parameter space where gradients vanish
The problem of learning long-term dependencies remains one of the main challenges in deep learning

Based on the idea of creating paths through time that have derivatives that neither vanish nor explode

Composed of an input, an input gate, a forget gate, an output gate and a state (the LSTM "cell")

The main difference with the LSTM is that a single gating unit simultaneously controls the forgetting factor and the decision to update the state unit
A crucial ingredient is the forget gate
Adding a bias of 1 to the LSTM forget gate makes the LSTM as strong as the best of the explored architectural variants

Neural networks excel at storing implicit knowledge. However, they struggle to memory facts
Memory networks include a set of memory cells that can be accessed via an addressing mechanism
Memory networks originally required a supervision signal instructing them how to use their memory cell
The neural Turing machine is able to learn to read from and to write arbitrary content to memory cells without explicit supervision about which actions to undertake, and allowed end-to-end training without this supervision signal, via the use of a content-based soft attention mechanism
Each memory cell can be thought of as an extension of the memory cells in LSTMs or GRUs. The difference is that the network outputs an internal state that chooses which cell to read from or write to, just as memory accesses in a digital computer read from or write to a specific address
It is difficult to optimize functions that produce exact, integer addresses. To alleviate this problem, NTMs actually read to or write from many memory cells simultaneously
- To read, they take a weighted average of many cells
- To write, they modify multiple cells by different amounts
There are two reason to increase the size of the memory cell:
- We have increased the cost of accessing a memory cell (accessing implies reading a number of cells)
- Vector-valued memory cells allow for content-based addressing, where the weight used to read or to write from a cell is a function of that cell

Determine your goals - what error metric to use, and your target value for this error metric
Establish a working end-to-end pipeline as soon as possible, including the estimation of the appropriate performance metrics
Instrument the system well to determine bottlenecks in performance. Diagnose which components are performing worse than expected and whether it is due to overfitting, underfitting, or a defect in the data or software
Repeatedly make incremental changes such as gathering new data, adjusting hyperparameters, or changing algorithms, based on specific findings from your instrumentation

The Bayes error defines the minimum error rate that you can hope to achieve, even if you have infinite training data and can recover the true probability distribution. This is because your input features may not contain complete information about the output variable, or because the system might be intrinsically stochastic
Precision: the fraction of detections reported by the model that were correct
Recall: the fraction of true events that were detected
PR curve: precision on the y-axis and recall on the x-axis
F-score (precision p, recall r)
$$ F = \frac{2pr}{p+r}$$
Coverage: the fraction of examples for which the machine learning system is able to produce a response

First, choose the general category of model based on the structure of your data
A reasonable choice of optimization algorithm is SGD with momentum with a decaying learning rate
- Popular decay schemes include decaying linearly until reaching a fixed minimum learning rate, decaying exponentially, or decreasing the learning rate by a factor of 2-10 each time validation error plateaus
Another very reasonable alternative is Adam
Batch normalization can have a dramatic effect on optimization performance, especially for convolutional networks and networks with sigmoidal nonlinearities
Unless your training data set contains tens of millions of examples or more, you should include some mild forms of regularization from the start
Early stopping should be used almost universally
Dropout is an excellent regularizer that is easy to implement and compatible with many models and training algorithms
Batch normalization also sometimes reduces generalization error and allows dropout to be omitted, due to the noise in the estimate of the statistics used to normalize each variable

It is often much better to gather more data than to improve the learning algorithm
First, determine whether the performance on the training set is acceptable
- If performance on the training set is poor, the learning algorithm is not using the training data that is already available, so there is no reason to gather more data
- Try increasing the size of the model by adding more layers or adding more hidden units to each layer
- Try improving the learning rate of the algorithm, for example by tuning the learning rate hyperparameters
- If large models and carefully tuned optimization algorithms do not work well, then the problem might be the quality of the training data
  - The data may be too noisy or may not include the right inputs needed to predict the desired outputs
Then measure the performance on a test set
- If the performance on the test set is also acceptable, then there is nothing left to be done
- If test set performance is much worse than training set performance, then gathering more data is one of the most effective solutions
- A simple alternative to gathering more data is to reduce the size of the model or improve regularization, by adjusting hyperparameters such as weight decay coefficients, or by adding regularization strategies such as dropout
- When deciding whether to gather more data, it is also necessary to decide how much to gather. It is helpful to plot curves showing the relationship between training set size and generalization error. By extrapolating such curves, one can predict how much additional training data would be needed to achieve a certain level of performance
- If gathering much more data is not feasible, the only other way to improve generalization error is to improve the learning algorithm itself

There are two basic approaches to choosing hyperparameters:
- Choosing them manually: requires understanding what the hyperparameters do and how machine learning models achieve good generalization
- Choosing them automatically: greatly reduce the need to understand these ideas, but they are often much more computationally costly

The goal of manual hyperparameter search is usually to find the lowest generalization error subject to some runtime and memory budget
The primary goal of manual hyperparameter search is to adjust the effective capacity of the model to match the complexity of the task
Effective complexity is constrained by three factors:
- the representational capacity of the model
- the ability of the learning algorithm to successfully minimize the cost function used to train the model
- the degree to which the cost function and training procedure regularize the model
The generalization error typically follows a U-shaped curve when plotted as a function of one of the hyperparameters
Low capacity, high generalization error because training error is high -> underfitting
High capacity, high generalization error because the gap between trainig and test error is high -> overfitting
Somewhere in the middle lies the optimal model capacity, which achieves the lowest possible generalization error
Not every hyperparameter will be able to explore the entire U-shaped curve
- Some are discrete, some are binary (thus only a few points along the curve are visited)
The learning rate is perhaps the most important hyperparameter
If your error on the training set is higher than your target error rate, you have no choice but to increase capacity
If your error on the test set is higher than your target error rate, you can take two kind of actions
- The goal is to reduce to gap between training error and test error without increasing training error faster than the gap decreases
- To reduce the gap, change regularization hyperparameters to reduce effective model capacity, such as by adding dropout or weight decay

We are trying to find a value of the hyperparameters that optimizes an objective function, such as validation error, sometimes under constraints (such as a budget for training time, memory or recognition time)
Hyperparameter otpimization algorithms often have their own hyperparameters, such as the range of values that should be explored for each of the learning algorithm's hyperparameters

The grid search algorithm trains a model for every joint specification of hyperparameter values in the Cartesian product of the set of values for each individual hyperparameter
The experiment that yield the best validation set error is then chosen as having found the best hyperparameters
The obvious problem with grid search is that its computational cost grows exponentially with the number of hyperparameters
- If there are m hyperparameters, each taking at most n values, then the number of training and evaluation trials required grows as $O(n^m)$

More convenient to use and converges much faster to good values of the hyperparameters (than grid search)
First we define a marginal distribution for each hyperparameter, then sample based on the given distribution
A random search can be exponentially more efficient than a grid search, when there are several hyperparameters that do not strongly affect the performance measure

We can build a model of the validation set error, then propose new hyperparameter guesses by performing optimization within this model
Contemporary approaches to hyperparameter optimization include Spearmint, TPE and SMAC

We do not know a priori what the intended behavior of the algorithm is
Most machine learning models have multiple parts that are each adaptive
Most debugging strategies for neural nets are designed to get around one or both of these two difficulties
- Either we design a case that is so simple that the correct behavior actually can be predicted, or we design a test that exercises one part of the neural network implementation in isolation
Some important debugging tests:
- Visualize the model in action
- Visualize the worst mistakes
- Reasoning about software using train and test error
- Fit a tiny dataset
- Compare back-propagated derivatives to numerical derivatives
- Monitor histograms of activations and gradient

A large population of neurons or features acting together can exhibit intelligent behavior
One of the key factors responsible for the improvement in neural network's accuracy and the improvement of the complexity of tasks they can solve between the 1980s and today is the dramatic increase in the size of the networks we use

Asynchronous stochastic gradient descent
- Several processor cores share the memory representing the parameters
- Each core reads parameters without a lock, then computes a gradient, then increments the parameters without a lock
Distributed asynchronous gradient descent remains the primary strategy for training large deep networks and is used by most major deep learning groups in industry

A key strategy for reducing the cost of inference is model compression
The basic idea of model compression is to replace the original, expensive model with a smaller model that requires less memory and runtime to store and evaluate
Model compression is applicable when the size of the original model is driven primarily by a need to prevent overfitting
We can generate a training set containing infinitely many examples, simply by applying f to randomly sampled points $\vector{x}$
We train the new, smaller, model to match $f(\vector{x})$ on these points. In order to most efficiently use the capacity of the new, small model, it is best to sample the new $\vector{x}$ points from a distribution resembling the actual test inputs that will be supplied to the model later

One strategy for accelerating data processing systems in general is to build systems that have dynamic structure in the graph describing the computation needed to be process an input
Data processing systems can dynamically determine which subset of many neural networks should be run on a given input
This form of dynamic structure inside neural networks is sometimes called conditional computation
A venerable strategy for accelerating inference in a classifier is to use a cascade of classifiers
A simple way to accomplish the union of deep learning and dynamic structure is to train a decision tree in which each node uses a neural network to make the splitting decision
One can use a neural network, called the gater to select which one out of several expert networks will be used to compute the output, given the current input
The first version of this idea is called the mixture of experts
One major obstacle to using dynamically structured systems is the decreased degree of parallelism that results from the system following different code branches for different inputs

Formatting images to have the same scale is the only kind of preprocessing that is strictly necessary
Many computer vision architectures require images of a standard size, so images must be cropped or scaled to fit that size

One of the most obvious source of variation that can be safely removed from many tasks is the amount of contrast in the image
Contrast refers to the magnitude of the difference between the bright and the dark pixels in an image
Global contrast normalization (GCN) aims to prevent images from having varying amounts of contrast by subtracting the mean from each image, then rescaling it so that the standard deviation across its pixels is equal to some constant s
Local contrast normalization ensures that the contrast is normalized across each small window, rather than over the image as a whole
One modifies each pixel by subtracting a mean of nearby pixels and dividing by a standard deviation of nearby pixels

The task of speech recognition is to map an acoustic signal containing a spoken natural language utterance into the corresponding sequence of words intended by the speaker

An n-gram is a sequence of n tokens
Models based on n-grams define the conditional probability of the n-th token given the preceding n - 1 tokens
A fundamental limitation of maximum likelihood for n-gram models is that $P_n$ as estimated from training set counts is very likely to be zero in many cases
Two catastrophic outcomes
- $P_{n-1}$ is zero, the ratio is undefined
- $P_{n-1}$ is non-zero but $P_n$ is zero, the test log-likelihood is $-\infty$
To avoid such catastrophic outcomes, most n-gram models employ some form of smoothing
- One basic technique consists of adding non-zero probability mass to all the possible next symbol values
- Another idea is to form a mixture model containing higher-order and lower-order n-gram models, with the higher-order models providing more capacity and the lower-order models being more likely to avoid counts of zero
One way to view a classical n-gram model is that it is performing nearest-neighbor lookup
Class-based language models introduce the notion of word categories and then share statistical strength between words that are in the same category

A class of language model designed to overcome the curse of dimensionality problem for modeling natural language sequences by using a distributed representation of words
These word representations are sometimes called word embeddings
We view the raw symbols as points in a space of dimension equal to the vocabulary size. The word representations embed those points in a feature space of lower dimension
In the original space, every word is represented by a one-hot vector, so every pair of words is at Euclidean distance $\sqrt(2)$ from each other
In the embedding space, words that frequently appear in similar contexts are close to each other

Vocabulary size being generally verge large, computing a softmax to determine the probability distribution of each word is very expensive (high computational cost to multiply)

Initially dealt with the high computational cost by limiting the vocabulary size to 10000 or 20000 words
The vocabulary $\mathbb{V}$ is split into two lists: a shortlist $\mathbb{L}$ of most frequent words and a tail $\mathbb{T} = \mathbb{V} \setminus \mathbb{L}$ of more rare words
A disadvantage of the short list approach is that the potential generalization advantage of the neural language models is limited to the most frequent words, where, arguably, it is the least useful

One can think of this hierarchy as building categories of words, then categories of categories of words, etc.
The probability of choosing a word is given by the product of the probabilities of choosing the branch leading to that word at every node on a path from the root of the tree to the leaf containing the word
An important advantage of the hierarchical softmax is that it brings computational benefits both at training time and at test time, if at test time we want to compute the probability of specific words
Another important operation is selecting the most likely word in a given context. Unfortunately the tree structure does not provide an efficient and exact solution to this problem
A disadvantage is that in practice the hierarchical softmax tends to give worse test results than sampling-based methods

Instead of sampling from the model, one can sample from another distribution, called the proposal distribution, and use appropriate weights to correct for the bias introduced by sampling from the wrong distribution

If we use hash tables or trees to access the count (of an n-gram), the computation used for n-grams is almost independent of capacity
In comparison, doubling a neural network's number of parameters typically also roughly doubles its computation time
One easy way to add capacity is to combine both approaches in an ensemble consisting of a neural language model and an n-gram language model

The task of reading a sentence in one natural language and emitting a sentence with the equivalent meaning in another language

We can think of an attention-based system as having three components:
- A process that "reads" raw data (such as source words in a source sentence), and converts them into distributed representations, with one feature vector associated with each word position
- A list of feature vectors storing the output of the reader. This can be understood as a "memory" containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them
- A process that "exploits" the content of the memory to sequentially perform a task, at each time step having the ability to put attention on the content of one memory element (or a few, with a different weight)

Two major types of applications: online advertising and item recommendations
Both rely on predicting the association between a user and an item, either to predict the probability of some action or the expected gain if an ad is shown or a recommendation is made regarding that product to that user
Often, this association problem is handled like a supervised learning problem: given some information about the item and about the user, predict the proxy of interest
This often ends up being either a regression problem or a probabilistic classification problem
Early work on recommender systems relied on minimal information as inputs for these predictions: the user ID and the item ID
There is a basic limitation of collaborative filtering systems: when a new item or a new user is introduced, its lack of rating history means that there is no way to evaluate its similarity with other items or users, or the degree of association between, say, that new user and existing items. This is called the problem of cold-start recommendations
A general way of solving the cold-start recommendation problem is to introduce extra information about the individual user and items
Systems that use such information are called content-based recommender systems

Many recommendation problems are most accurately described theoretically as contextual bandits
The term contextual bandits refers to the case where the action is taken in the context of some input variable that can inform the decision (e.g., the user identity)

In mathematics, a binary relation is a set of ordered pairs of objects
In the context of AI, we think of a relation as a sentence in a syntactically simple and highly structured language
- (subject, verb, object)
We can define an attribute, a concept analogous to a relation, but taking only one argument
- (subject, attribute)
Early work used vectors for entities and matrices for relations, with the idea that a relation acts like an operator on entities
Alternatively, relations can be considered as any other entity, allowing us to make statements about relations, but more flexibility is put in the machinery that combines them in order to model their joint distribution
A practical short-term application of such models is link prediction: predicting missing arcs in the knowledge graph
Knowledge of relations combined with a reasoning process and understanding of natural language could allow us to build a general question answering system
A general question answering system must be able to process input information and remember important facts, organized in a way that enables it to retrieve and reason about them later

A linear factor model is defined by the use of a stochastic, linear decoder function that generates $\vector{x}$ by adding noise to a linear transformation of $\vector{h}$
These models are interesting because they allow us to discover explanatory factors that have a simple joint distribution
First, we sample the explanatory factors $\vector{h}$ from a distribution
Next we sample the real valued observable variables given these factors
$$ \vector{x} = \vector{Wh} + \vector{b} + noise$$

Principal PCA, factor analysis and other linear factor models are special cases and only differ in the choices made for the noise distribution and the model's prior over latent variable $\vector{h}$ before observing $\vector{x}$
In factor analysis, the latent variable prior is just the unit variance Gaussian
$$ \vector{\text{h}} \sim \mathcal{N}(\vector{h};\vector{0}, \vector{I})$$
The role of the latent variables is to capture the dependencies between the different observed variables

It is an approach to modeling linear factors that seeks to separate an observed signal into many underlying signals that are scaled and added together to form the observed data
These signals are intended to be fully independent, rather than merely decorrelated from each other

Slow Feature Analysis (SFA) is a linear factor model that uses information from time signals to learn invariant features
The slowness principle: The important characteristics of scenes change very slowly compared to the individual measurements that make up a description of a scene
SFA seems to be a reasonably biologically plausible model
A major advantage of SFA is that it is possible to theoretically predict which features SFA will learn

Training sparse coding with maximum likelihood is intractable
The training alternates between encoding the data and training the decoder to better reconstruct the data given the encoding

An autoencoder is a neural network that is trained to attempt to copy its input to its output
Internally, it has a hidden layer $\vector{h}$ that describes a code used to represent the input
The network may be viewed as consisting of two parts
- An encoder function $\vector{h} = f(\vector{x})$
- A decoder that produces a reconstruction $\vector{r} = g(\vector{h})$
Autoencoders are designed to be unable to learn to copy perfectly
Autoencoders may be thought of as being a special case of feedforward networks, and may be trained with all the same techniques, typically minibatch gradient descent following gradients computed by back-propagation
Autoencoders may also be trained using recirculation, a learning algorithm based on comparing the activations of the network on the original input to the activation on the reconstructed input
Recirculation is regarded as more biologically plausible than back-propagation, but is rarely used for machine learning applications

We hope that training the autoencoder to perform the input copying task will result in $\vector{h}$ taking on useful properties
One way to obtain useful features from the autoencoder is to constrain $\vector{h}$ to have smaller dimension than $\vector{x}$
An autoencoder whose code dimension is less than the input dimension is called undercomplete

Rather than limiting the model capacity by keeping the encoder and decoder shallow and the code size small, regularized autoencoders use a loss function that encourages the model to have other properties besides the ability to copy its input to its output
These other properties include
- Sparsity of the representation
- Smallness of the derivative of the representation
- Robustness to noise or to missing inputs
A regularized autoencoder can be nonlinear and overcomplete but still learn something useful about the data distribution even if the model capacity is great enough to learn a trivial identity function
Nearly any generative model with latent variables and equipped with an inference procedure (for computing latent representations given input) may be viewed as a particular form of autoencoder

A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity penalty $\Omega(\vector{h})$ on the code layer $\vector{h}$, in addition to the reconstruction error
Sparse autoencoders are typically used to learn features for another task such as classification

Rather than adding a penalty $\Omega$ to the cost function, we can obtain an autoencoder that learns something useful by changing the reconstruction error term of the cost function
A denoising autoencoder (DAE) minimizes $\vector{\tilde{x}}$, a copy of $\vector{x}$ that has been corrupted by some form of noise
$$ L(\vector{x}, g(f(\vector{\tilde{x}})))$$
Denoising autoencoders must undo this corruption rather than simply copying their input

Another strategy for regularizing an autoencoder is to use a penalty $\Omega$ as in sparse autoencoders, but with a different form of $\Omega$
This forces the model to learn a function that does not change much when $\vector{x}$ changes slightly
An autoencoder regularized in this way is called a contractive autoencoder (CAE)

One major advantage of non-trivial depth (in autoencoders) is that the universal approximator theorem guarantees that a feedforward neural network with at least one hidden layer can represent an approximation of any function to an arbitrary degree of accuracy, provided that it has enough hidden units
Depth can exponentially reduce the computational cost of representing some functions
Depth can also exponentially decrease the amount of training data needed to learn some functions
A common strategy for training a deep autoencoder is to greedily pretrain the deep architecture by training a stack of shallow autoencoders

Score matching is an alternative to maximum likelihood
It provides a consistent estimator of probability distributions based on encouraging the model to have the same score as the distribution at every training point $\vector{x}$
A very important property of DAEs is that their training criterion makes the autoencoder learn a vector field $g(f(\vector{x}) - \vector{x})$ that estimates the score of the data distribution
Denoising training of a specific kind of autoencoder (sigmoidal hidden units, linear reconstruction units) using Gaussian noise and mean squared error as the reconstruction cost is equivalent to training a specific kind of undirected probabilistic model called an RBM with Gaussian visible units
One may want to use the autoencoder as a generative model and draw samples from this distribution

An important characteristic of a manifold is the set of its tangent planes
An autoencoder can afford to represent only the variations that are needed to reconstruct training examples
What is most commonly learned to characterize a manifold is a representation of the data points on (or near) the manifold. Such a representation for a particular example is also called its embedding
There is a fundamental difficulty with local non-parametric approaches to manifold learning: if the manifolds are not very smooth (they have many peaks and troughs and twists), one may need a very large number of training examples to cover each one of these variations, with no chance to generalize to unseen variations

The contractive autoencoder introduces an explicit regularizer on the code $\vector{h} = f(\vector{x})$, encouraging the derivatives of $f$ to be as small as possible
Denoising autoencoders make the reconstruction function resist small but finite-sized perturbations of the input, while contractive autoencoders make the feature extraction function resist infinitesimal perturbations of the input
The name contractive arises from the way that the CAE warps space

Predictive sparse decomposition (PSD) is a model that is a hybrid of sparse coding and parametric autoencoders
A parametric encoder is trained to predict the output of iterative inference
The model consists of an encoder $f(\vector{x})$ and a decoder $g(\vector{h})$ that are both parametric

Many information processing tasks can be very easy or very difficult depending on how the information is presented
A good representation is one that makes a subsequent learning task easier

Relies on a single-layer representation learning algorithm such as an RBM, a single-layer autoencoder, a sparse coding model, or another model that learns latent representations
Each layer is pretrained using unsupervised learning, taking the output of the previous layer and producing as output a new presentation of the data, whose distribution is hopefully simpler
It is a greedy algorithm, meaning that it optimizes each piece of the solution independently, one piece at a time, rather than jointly optimizing all pieces
It is called layer-wise because these independent pieces are the layers of the network
It is called unsupervised because each layer is trained with an unsupervised representation learning algorithm

Unsupervised pretraining combines two different ideas
- It makes use of the idea that the choice of initial parameters for a deep neural network can have a significant regularizing effect on the model
- It makes use of the more general idea that learning about the input distribution can help to learn about the mapping from inputs to outputs
Unsupervised pretraining is likely to be most useful when the function to be learned is extremely complicated
Neural networks that receive unsupervised pretraining consistently halt in the same region of function space, while neural networks without pretraining consistently halt in another region
Compared to other forms of unsupervised learning, unsupervised pretraining has the disadvantage that it operates with two separate training phases
Unsupervised pretraining does not offer a clear way to adjust the strength of the regularization arising from the unsupervised stage
Another disadvantage of having two separate training phases is that each phase has its own hyperparameters
The performance of the second phase usually cannot be predicted during the first phase, so there is a long delay between proposing hyperparameters for the first phase and being able to update them using feedback from the second phase
Today, unsupervised pretraining has been largely abandoned, except in the field of natural language processing, where the natural representation of words as one-hot vectors conveys no similarity information and where very large unlabeled sets are available
In that case, the advantage of pretraining is that one can pretrain once on a huge unlabeled set, learn a good representation, and then use this representation or fine-tune it for a supervised task for which the training set contains substantially fewer examples

Transfer learning and domain adaptation refer to the situation where what has been learned in one setting is exploited to improve generalization in another setting
In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in $P_1$ are relevant to the variations that need to be captured for learning $P_2$
In domain adaptation, the task (and the optimal input-to-output mapping) remains the same between each setting, but the input distribution is slightly different
A related problem is that of concept drift, which we can view as a form of transfer learning due to gradual changes in the data distribution over time
Both concept drift and transfer learning can be viewed as particular forms of multi-task learning
The core idea of representation learning is that the same representation may be useful in two different settings
Two extreme forms of transfer learning are one-shot learning and zero-shot learning (zero-data learning)
- Only one labeled example of the transfer task is given for one-shot learning, while no labeled examples are given at all for the zero-shot learning task
One-shot learning is possible because the representation learns to cleanly separate the underlying classes
Zero-shot learning is similar to reading a large collection of text and then solve object recognition problem. It may be possible to recognize a specific object class even without having seen an image of that object, if the text describes the object well enough

An important question about representation learning is "what makes one representation better than another?"
One hypothesis is that an ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another
In other approaches to representation learning, we have often been concerned with a representation that is easy to model - for example, one whose entries are sparse, or independent from each other

Distributed representations of concepts - representations composed of many elements that can be set separately from each other
Distributed representations are powerful because they can use $n$ features with $k$ values to describe $k^n$ different concepts
An important related concept that distinguishes a distributed representation from a symbolic one is that generalization arises due to shared attributes between different concepts
Some traditional non-distributed learning algorithms generalize only due to the smoothness assumption, which states that if $u \approx v$, then the target function $f$ to be learned has the property that $f(u) \approx f(v)$, in general
This assumption is clearly very useful, but it suffers from the curse of dimensionality: in order to learn a target function that increases and decreases many times in many different regions, we may need a number of examples that is at least as alrge as the number of distinguishable regions

It has been proven in many different settings that organizing computation through the composition of many nonlinearities and a hierarchy of reused features can give an exponential boost to statistical efficiency, on top of the exponential boost given by using a distributed representation

A structured probabilistic model is a way of describing a probability distribution, using a graph to describe which random variables in the probability distribution interact with each other directly
Because the structure of the model is defined by a graph, these models are often also referred to as graphical models

Tasks of probabilistic models
- Density estimation
- Denoising
- Missing value imputation
- Sampling
Structured probabilistic models provide a formal framework for modeling only direct interactions between random variables. This allows the models to have significantly fewer parameters and therefore be estimated reliably from less data

Structured probabilistic models use graphs to represent interactions between random variables
- Each node represents a random variable
- Each edge represents a direct interaction
There are two categories of graphical models:
- Models based on directed acyclic graphs
- Models based on undirected graphs

Also known as the belief network or Bayesian network
The direction of the arrow indicates which variable's probability distribution is defined in terms of the other's
So long as each variable has few parents in the graph, the distribution can be represented with very few parameters

Also known as Markov random fields (MRFs) or Markov networks

One advantage of directed graphical models is that a simple and efficient procedure called ancestral sampling can produce a sample from the joint distribution represented by the model
The basic idea is to sort the variables $x_i$ in the graph into a topological ordering, so that for all $i$ and $j$, $j$ is greater than $i$ if $x_i$ is a parent of $x_j$. The variables can then be sampled in this order
One drawback from ancestral sampling is that it only applies to directed graphical models
Another drawback is that it does not support every conditional sampling operation
The conceptually simplest approach (to sampling undirected graphical models) is Gibbs sampling
Suppose we have a graphical model over an n-dimensional vector of random variables $\vector{x}$. We iteratively visit each variable $x_i$ and draw a sample conditioned on all of the other variables, from $p(x_i\ |\ x_{-i})$. Unfortunately, after we have made one pass through the graphical model and sampled all $n$ variables, we still do not have a fair sample from $p(\vector{x})$. Instead, we must repeat the process and resample all $n$ variables using the updated values of their neighbors

They allow us to dramatically reduce the cost of representing probability distribution as well as learning and inference
Sampling is also accelerated in the case of directed models
The primary mechanism that allows all of these operations to use less runtime and memory is choosing not to model certain interactions

Also known as harmonium
The RBM is not itself a deep model
It has a single layer of latent variables that may be used to learn a representation for the input
The canonical RBM is an energy-based model with binary visible and hidden units
Its energy function is (where $\vector{b}$, $\vector{c}$, and $\vector{W}$ are unconstrained, real-valued, learnable parameters)
$$ E(\vector{v}, \vector{h}) = -\vector{b}^\top\vector{v} - \vector{c}^\top\vector{h} - \vector{v}^\top\vector{W}\vector{h}$$

Randomized algorithms fall into two rough categories: Las Vegas algorithms and Monte Carlo algorithms
Las Vegas algorithms always return precisely the correct answer (or report that they failed)
These algorithms consume a random amount of resources, usually memory or time
Monte Carlo algorithms return answers with a random amount of error. The amount of error can typically be reduced by expending more resources
For any fixed computational budget, a Monte Carlo algorithm can provide an approximate answer

Sampling provides a flexible way to approximate many sums and integrals at reduced cost
Sometimes we use this to provide a significant speedup to a costly but tractable sum
In other cases, our learning algorithm requires us to approximate an intractable sum or integral

The core idea of a Markov chain is to have a state $\vector{x}$ that begins as an arbitrary value. Over time, we randomly update $\vector{x}$ repeatedly. Eventually $\vector{x}$ becomes (very nearly) a fair sample from $p(\vector{x})$

Intractable inference problems in deep learning usually arise from interactions between latent variables in a structured graphical model

Many approaches to confronting the problem of difficult inference make use of the observation that exact inference can be described as an optimization problem

We define the Boltzmann machine over a d-dimensional binary random vector $\vector{x} \in \{0, 1\}^d$
The Boltzmann machine is an energy-based model, meaning we define the joint probability distribution using an energy function
$$ P(\vector{x}) = \frac{\exp(-E(\vector{x}))}{Z}$$
where $E(\vector{x})$ is the energy function and $Z$ is the partition function that ensures that $\sum_\vector{x} P(\vector{x}) = 1$
The energy function of the Boltzmann machine is given by
$$ E(\vector{x}) = -\vector{x}^\top\vector{U}\vector{x} - \vector{b}^\top\vector{x}$$
where $\vector{U}$ is the "weight" matrix of model parameters and $\vector{b}$ is the vector of bias parameters
A Boltzmann machine with hidden units is no longer limited to modeling linear relationships between variables
The Boltzmann machine becomes a universal approximator of probability mass functions over discrete variables

RBMs are undirected probabilistic graphical models containing a layer of observable variables and a single layer of latent variables

Deep belief networks are generative models with several layers of latent variables
The latent variables are typically binary, while the visible units may be binary or real
There are no intralayer connections
Usually, every unit in each layer is connected to every unit in each neighboring layer
The connections between the top two layers are undirected
The connections between all other layers are directed, with the arrows pointed toward the layer that is closest to the data
To train a deep belief network, one begins by training an RBM to maximize $\mathbb{E}_{\vector{v} \sim p_{data}} \log p(\vector{v})$ using contrastive divergence or stochastic maximum likelihood
Next, a second RBM is trained to approximately maximize
$$ \mathbb{E}_{\vector{v} \sim p_{data}}\mathbb{E}_{\vector{h}^{(1)} \sim p^{(1)}(h^{(1)}\ |\ \vector{v})}\log p^{(2)}(\vector{h}^{(1)})$$
where $p^{(1)}$ is the probability distribution represented by the first RBM and p^{(2)} is the probability distribution represented by the second RBM
In other words, the second RBM is trained to model the distribution defined by sampling the hidden units of the first RBM, when the first RBM is driven by the data

Unlike the deep belief network (DBN), it is an entirely undirected model
Unlike the RBM, the DBM has several layers of latent variables
Like the RBM, within each layer, each of the variables are mutually independent, conditioned on the variables in the neighboring layers

Sigmoid belief networks are a simple form of directed graphical model with a specific kind of conditional probability distribution
The most common structure of sigmoid belief network is one that is divided into many layers, with ancestral sampling proceeding through a series of many hidden layers then ultimately generating the visible layer. This structure is very similar to the deep belief network, except that the units at the beginning of the sampling process are independent from each other, rather than sampled from a restricted Boltzmann machine

The model transforms samples of latent variable $\vector{z}$ to samples $\vector{x}$ or to distributions over samples $\vector{x}$ using a differentiable function $g(\vector{z};\vector{\theta}^{(g)})$ which is typically represented by a neural network
This model class includes
- variational autoencoders, which pair the generator net with an inference net
- generative adversarial networks, which pair the generator network with a discriminator network
- techniques that train generator networks in isolation
Generator networks are essentially just parameterized computational procedures for generating samples, where the architecture provides the family of possible distributions to sample from and the parameters select a distribution from within that family
The two approaches to formulating generator nets - emitting the parameters of a conditional distribution versus directly emitting samples - have complementary strengths and weaknesses. When the generator net defines a conditional distribution over $\vector{x}$, it is capable of generating discrete data as well as continuous data. When the generator net provides samples directly, it is capable of generating only continuous data. The advantage to direct sampling is that we are no longer forced to use conditional distributions whose form can be easily written down and algebraically manipulated by a human designer

One very nice property of the variational autoencoder is that simultaneously training a parametric encoder in combination with the generator network forces the model to learn a predictable coordinate system that the encoder can capture

The generator network directly produces samples $\vector{x} = g(\vector{z}; \vector{\theta}^{(g)})$
Its adversary, the discriminator network, attempts to distinguish between samples drawn from the training data and samples drawn from the generator
The discriminator emits a probability value given by $d(\vector{x}; \vector{\theta}^{(d)})$, indicating the probability that $\vector{x}$ is a real training example rather than a fake sample drawn from the model

Convolutional networks for recognition tasks have information flow from the image to some summarization layer at the top of the network, often a class label
As this image flows upward through the network, information is discarded as the representation of the image becomes more invariant to nuisance transformations
In a generator network, the opposite is true. Rich details must be added as the representation of the image to be generated propagates through the network
The primary mechanism for discarding information in a convolutional recognition network is the pooling layer. The generator network seems to need to add information. We cannot put the inverse of a pooling layer into the generator network because most pooling functions are not invertible
A simpler operation is to merely increase the spatial size of the representation

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
https://www.amazon.com/Deep-Learning-Adaptive-Computation-Machine/dp/0262035618