Learning a joint model of images and titles
This lesson introduces a recent technique for training joint models using feature vectors of picture titles and picture pixels. There should be a connection between these two inputs and an auxiliary picture search. At the end, a video showing the input text to generate a picture and inputting the picture to produce text is displayed.
The difficulty of this model is more complicated than the joint model of “tags and pictures” introduced in the previous lesson. The training method is:
- Multi-layer model on training images
- Training a multi-layer model on another word frequency vector
- Connect the two models together with a new top level layer. Use joint training to train the entire system to improve features.
- The specific training uses a deep Boltzmann machine instead of a deep belief net. Because the former allows the features of the upper layer to improve the characteristics of the lower layer
The pre-training of the DBM hidden layer is the biggest difficulty. The commonly used pre-training is generally DBN instead of DBM.
Combine 3 RBMs to form DBM
This is a skill job, and the stack of arhats accidentally stacked into DBN instead of DBM.
The correct way is to let the weight scale symmetrical during training, and halve it when merging:
Note that the multiple on the arc exists because the inference for h1 can come from the upper and lower directions. In order to avoid double counting, the upper and lower weights are halved. Strict proof depends on the paper.
The effect is as follows:
Hierarchical coordinate frames
This section describes identifying objects by identifying the parts of the object. There are roughly three ideas for this task: deep CNN (currently best), part-based approach (most likely beyond CNN), and hand-coded features of traditional computer vision scholars (which have been defeated by CNN).
Why is CNN destined to be finished?
Even if CNN is in full swing, its innate flaws are destined to go not far. Pooling uses a bunch of duplicate feature detectors, but loses the location information they are in. Even if you overlap some pools, you can make up for it, but it won’t solve the fundamental problem. CNN still can’t solve the problem of perspective transformation, unless it uses a lot of repeated feature detectors, but it increases the complexity of the model. Instead, humans are very good at this (and are trying to “bionics”). The current solution is to manually transform the image, which is to increase the training set to overcome the problem.
The hierarchical coordinate frame approach
A smarter approach would be to use some neurons to represent the shape of the feature and the posture of the retina. Posture refers to the relationship between the retina axis and the feature axis.
With this relationship, it is possible to identify a wider range of features by the gestures of multiple features. such as:
If you see a mouth under your nose, you may think this may be a face. But if the nose goes to the mouth, it can’t remind people of a face. On the left side, if the nose or mouth is predicted separately, the two results are consistent. The right side is inconsistent.
Two layers in a hierarchy of parts
When multiple low-level visual units support a larger pose at the same time, the high-level visual unit corresponding to the pose is activated:
Ti is the posture of a certain mouth, p is a logical unit, and Tij represents the spatial positional relationship between the mouth and the face. The process of computer vision is the opposite of the process of computer graphics.
The decisive feature of the pose vector
With these vectors, the spatial transformation can be obtained by linear calculation. So learning the hierarchical visual entity becomes simple, and the generalization from different perspectives becomes simple. The invariance of a shape is no longer reflected in the activation value, but in the spatial relationship Tij. When the angle of view changes, the activation value also changes, the spatial relationship does not change, and the posture of the object changes.
The human visual system uses a coordinate system to represent shapes
Which country is the map on the left? Most people will say that it is Australia, and some people say it is Africa, but it is not. What is the shape on the right? Some people say that it is a square, but in fact the four corners are not right angles. Some people say it is a diamond, but the two vertices are not equal.
In order to recognize a shape, the human visual system utilizes the coordinate system (or the spatial relationship of the components), except that different people use different coordinate systems.
Bayesian optimization of neural network hyperparameters
This section introduces the latest method of tuning, using machine learning to adjust, rather than graduate students (or “experts”) hand-tuned. The Gaussian process is very good at the field of “similar input leads to similar output”, and the machine learning field happens to be such a field. So it can be used to automatically find the parameters.
Using machine learning to find hyperparameters
One of the most common reasons people are afraid of using neural networks is that tuning requires too much expertise. such as:
- Number of network layers
- Number of units per floor
- Unit type
- Weight penalty
- Learning rate
- Momentum, etc.
A naive approach is grid search, which is to search for all combinations of hyperparameters, but this is very computationally demanding. Another problem is that when fixing other parameters to adjust a parameter, if this parameter happens to have no effect, it is useless. A little improvement is a random combination, but it’s not so good. You are not blindly searching?
Machine learning to save the field
Can you use machine learning to simulate postgraduate tuning? Yes, instead of blindly using random hyperparameter combinations, you can first analyze the current results, predict which hyperparameter spaces may give better results, and then explore these hyperparameter spaces in a targeted manner.
Assuming that the cost of exploring a hyperparameter combination is huge, it is relatively inexpensive to form a model that predicts the effects of hyperparameter combinations.
Gaussian Process models
This model only makes a simple a priori assumption: similar inputs result in similar outputs. For each dimension of the hyperparameter vector, how much difference they learn in this dimension is called dissimilarity. The GP model can predict not only the mean of a set of hyperparametric results, but also an entire Gaussian distribution (which is why it gets its name). For similar inputs, the variance of the predicted output is small, but for dissimilar inputs, the variance is large.
Record the current best combination of hyperparameters, and select the combination that is the most likely to increase the outcome for each trial. Just like a hedge fund, if an index goes up, I win, otherwise you win.
The red line is the current best score, and the expected scores for the three ABC combinations are shown in the Green Line Gaussian distribution. Since C has the largest area above the red line, it is best to bet that C is a good spare tire, even if its mean value is significantly below the red line.
How’s the effect
If the calculation power is sufficient, the Gaussian process is much better than manual tuning. Because there are too many hyperparameters, humans are dizzy, noticing the impact of a hyperparameter.
In addition, this can also prevent cheating. When people write a thesis to prove that the method they invented has significant advantages, they often desperately optimize the hyperparameters of the new method, and do not bother to optimize the opponent’s hyperparameters. For computers, this does not exist.
In this final section, Father tells why the long-term predictions about machine learning are stupid.
Foggy day driving
For example, when driving on a sunny night, the number of photons emitted by the taillights of the preceding vehicle to the driver’s retina is inversely proportional to the square of the distance between the two vehicles, which is 1/(d^2). In foggy days, the short distance is still 1/(d^2), while the long distance is the exponential decay e^(−d) because of the absorption of photons by the mist.
When the short-distance model 1/(d^2) predicts that there is no car, it may actually have a car, which will cause the car to be destroyed.
Exponential growth effect
The growth of knowledge is also exponential. In the short-term, things are developing slowly, and it is possible to predict, for example, what is the function of the next generation iPhone. But long-term predictions are not allowed, just like fog.
So the long-term future of machine learning is not limited, but in the next five years (to 2017), large deep neural networks will certainly make many amazing achievements.
Finally, the Neural Networks for Machine Learning note series ends. Although the first course of getting started as a neural network, this course is slightly difficult, but what you get is more, congratulations!