There are too many open classes and tutorials to repeatedly spread the convolutional neural network, but they don’t say anything about “convolution”. It seems that all readers have a relevant foundation by default. This foreign language is both friendly and in-depth, so it was translated. The high-level explanation of convolution through fluid mechanics quantum mechanics and so on in my article is a bit radical in my opinion. These areas are probably more esoteric than convolutions, so just take a brief look. The following is the body:

Convolution is now probably the most important concept in deep learning. It is through convolution and convolutional neural networks that deep learning transcends almost all other machine learning methods. But why is convolution so powerful? What is its principle? In this blog I will explain convolution and related concepts to help you understand it thoroughly.

There are already many blogs on the web that explain convolutions in convolution and deep learning, but I find that they all add too much unnecessary mathematical details when they come up, which is difficult and difficult to understand. Although this blog has a lot of mathematical details, I will show them step by step in a visual way to make sure everyone understands. The first part of the article is intended to help the reader understand the concept of convolution and the convolutional network in deep learning. The second part introduces some advanced concepts designed to help researchers and advanced players in deep learning to further deepen their understanding of convolution.

## What is convolution

The whole blog will explore this issue, but it will be helpful to grasp the context. So roughly speaking, what is convolution?

You can think of convolution as a means of mixing information. Imagine two buckets filled with information, we pour them into a bucket and stir with some sort of rule. That is to say, convolution is a process of mixing two kinds of information.

Convolution can also be described formally. In fact, it is a mathematical operation, and there is no essential difference between subtraction, addition and multiplication. Although the operation itself is complex, it is very helpful in simplifying more complex expressions. In physics and engineering, convolution is widely used to simplify equations—after a brief formal description of convolution—we will associate thoughts and deep learning in these areas to deepen convolution Understanding. But now let’s understand the convolution from a practical perspective.

### How do we apply convolution to images?

When we apply convolution on an image, we perform convolution in two dimensions—horizontal and vertical. We mix two buckets of information: the first bucket is the input image, consisting of three matrices – RGB three channels, each of which is an integer between 0 and 255. The second bucket is the convolution kernel (kernel), a single floating point matrix. Think of the size and pattern of the convolution kernel as a way to agitate the image. The output of the convolution kernel is a modified image, often referred to as a feature map in deep learning. There is a feature map for each color channel.

Edge detection convolution kernel effect

How is this done? We will now demonstrate how to mix these two kinds of information by convolution. One way is to take a block of the same size as the convolution kernel from the input picture – here the picture is assumed to be 100×100, the size of the convolution kernel is 3 × 3, then the block size we take out is 3 × 3 – then multiply and sum the elements of each pair of the same position (unlike matrix multiplication, but similar to the vector inner product, here are two “Point multiplication” of a matrix of the same size. The sum of the products produces a pixel in the feature map. When a pixel is calculated, move one pixel to take a block and perform the same operation. The calculation of the feature map ends when it is no longer possible to move to get a new block. This process can be demonstrated with the following animation:

RAM is the input picture, Buffer is the feature map

You may notice that there is a normalization factor m, where the value of m is the size of the kernel 9; this is to ensure that the input image and the feature map have the same brightness.

### Why is image convolution useful in machine learning?

The image may contain a lot of noise that we don’t care about. A good example is the project I did with Jannek Thomas at Burda Bootcamp. Burda Bootcamp is a lab that allows students to create technological storms in a very short time like a hackathon. Together with 9 colleagues, we made 11 products in 2 months. One of them is a search engine for fashion images with a deep encoder: you upload a picture of a fashion dress, and the encoder automatically finds similar styles.

If you want to distinguish the style of the clothes, then the color of the clothes is not that important; in addition, details such as trademarks are not that important. The most important thing is the shape of the clothes. In general, the shape of a blouse is very different from the appearance of a shirt, jacket, and pants. If we filter out this extra noise, our algorithm will not be distracted by details such as color and trademark. We can easily do this through convolution.

My colleague Jannek Thomas removed all the information in the image except the edges by using the Sobel edge detection filter (similar to the previous image) – which is why convolution applications are often referred to as filtering and convolution kernels are often The reason for being called a filter (more precisely defined below). The feature map generated by the edge detection filter is very useful for distinguishing the type of clothing because only the shape information is preserved.

In the upper left corner of the color map is the search query, the other is the search results, you will find that the automatic encoder really only pays attention to the shape of the clothes, not the color.

Going a step further: There are many different cores that can produce multiple feature maps, such as sharpening images (emphasizing details), or blurring images (reducing details), and each feature map may help the algorithm make decisions (some details such as clothes) There are 3 buttons instead of two, which may distinguish some costumes).

Using this approach—reading in input, transforming input, and then feeding the feature map to an algorithm—is called feature engineering. Feature engineering is very difficult and there is very little information to help you get started. As a result, few people are skilled at applying feature engineering in multiple areas. Feature engineering is – purely manual – and the most important skill in the Kaggle competition. The reason why feature engineering is so difficult is that the useful features are different for each type of data: the characteristics of the image class task may not work for the time series task; even if both tasks are image classes, it is very It is difficult to find the same effective features, because the useful features are different depending on the object to be identified. This is very dependent on experience.

So feature engineering is especially difficult for novices. But for images, is it possible to use the convolution kernel to automatically find the most suitable feature for a task?

### Convolutional neural network

The convolutional neural network is doing this. Unlike convolution kernels that have just used fixed numbers, we assign parameters to these cores, and the parameters are trained on the data. With the training of convolutional neural networks, these convolution kernels will get better and better filtering on images or feature maps in order to get useful information. This process is automatic and is called feature learning. Feature learning automatically adapts to new tasks: we just need to train on new data to automatically find new filters. This is why the convolutional neural network is so powerful – no need for heavy feature engineering!

Usually convolutional neural networks do not learn a single core, but learn multiple cores at multiple levels. For example, a 32x16x16 core that uses 256×256 images will generate 32 feature maps of 241×241 (image size – kernel size + 1). So 32 useful new features are automatically obtained. These features can be used as input to the next core. Once we have learned the multi-level features, we simply pass them to a fully connected, simple neural network, which completes the classification. This is the conceptual understanding of all the knowledge required for convolutional neural networks (pooling is also an important topic, but it’s still covered in another blog post).

## Part II: Advanced concepts

We now have a good initial understanding of convolution, and we know what the convolutional neural network is doing and why it is so powerful. Now let’s dive into what’s going on in the convolution operation. We will realize that the explanation of the convolution is superficial and there is a more elegant explanation here. Through in-depth understanding, we can understand the nature of convolution and apply it to many different data. The first step is to understand the principle of convolution.

### Convolution theorem

To understand the convolution, you have to mention the convolution theorem, which maps the complex convolutions in the time and space domains to simple products between the elements in the frequency domain. This theorem is very powerful and has been widely used in many scientific fields. The convolution theorem is also one reason why the fast Fourier transform algorithm is called one of the most important algorithms of the 20th century.

The first equation is a convolution of two continuous functions on a one-dimensional continuous domain; the second equation is a convolution on a two-dimensional discrete domain (image). Here

Refers to convolution,Refers to the Fourier transform,

Representing the inverse Fourier transform,

Is a normalized constant. Here, “discrete” means that the data consists of a finite number of variables (pixels); one dimension means that the data is one-dimensional (time), the image is two-dimensional, and the video is three-dimensional.

In order to better understand the convolution theorem, we also need to understand the Fourier transform in digital image processing.

### Fast Fourier transform

Fast Fourier Transform is an algorithm that converts data in the time and space domains into the frequency domain. The Fourier transform uses the sum of some sine and cosine waves to represent the original function. It must be noted that the Fourier transform generally involves complex numbers, that is, a real number is transformed into a complex number with real and imaginary parts. Usually the imaginary part is only useful in some areas, such as transforming the frequency domain back into the time domain and the airspace; it will be ignored in this blog. You can see below how a signal (a time-dependent function, usually called a signal) is Fourier transformed:

Red is the time domain and blue is the frequency domain.

You may have said that you have never seen these things, but I am sure you have seen them in life: if red is a piece of music, then the blue value is the spectrum you see on your MP3 player screen. :

### Image on the Fourier domain

How do we imagine the frequency of pictures? Imagine a piece of paper with only two modes. Now that the paper is erected and viewed in the direction of the line, you will see one bright spot. These waves that divide the black and white portion at regular intervals represent the frequency. In the frequency domain, the low frequency is closer to the center and the higher frequency is closer to the edge. The position of high intensity (brightness, white) in the frequency domain represents the direction in which the brightness of the original image changes. This is particularly evident in the next plot with its logarithmic Fourier transform (logarithmic to the real part of the Fourier transform, which reduces the difference in pixel brightness and facilitates viewing of a wider luminance region):

We can immediately see that the Fourier transform contains information about the orientation of the object. If the object is rotated by an angle, it may be difficult to judge from the image pixels, but it can be clearly seen from the frequency domain.

This is a very important inspiration. Based on the Fourier theorem, we know that convolutional neural networks detect images in the frequency domain and capture the direction information of the object. Convolutional neural networks are then better at handling rotated images than traditional algorithms (although they are still not comparable to humans).

### Frequency filtering and convolution

Why convolution is often described as filtering, why are convolution kernels often referred to as filters? The following example can be explained:

If we perform a Fourier transform on the image and multiply it by a circle (the background is filled with black, which is 0), we can filter out all the high frequencies (they will become 0 because the padding is 0). Note that the filtered image still has a stripe pattern, but the image quality has dropped a lot – this is how the jpeg compression algorithm works (although somewhat different but uses a similar transformation), we transform the graph, then only retain part of the frequency, and finally The inverse transform is a two-dimensional picture; the compression ratio is the ratio of the black background to the circle.

We now think of a circle as a convolution kernel, and then there is a complete convolution process – as seen in a convolutional neural network. There are a lot of tricks to perform the Fourier transform quickly and steadily, but this is the basic idea.

Now that we have understood the convolution theorem and the Fourier transform, we can apply these concepts to other scientific fields to enhance our understanding of convolution in deep learning.

### Inspiration from fluid mechanics

Fluid mechanics creates a large number of differential equation models for air and water. The Fourier transform not only simplifies convolution, but also simplifies differentiation, or any field that utilizes differential equations. Sometimes the only way to get an analytical solution is to perform a Fourier transform on the differential equation. In this process, we often write a form of convolution of two functions for a simpler expression. This is an application in one dimension, and there are applications in two dimensions, such as astronomy.

### Diffusion

You can mix two liquids (milk and coffee) by applying an external force (stirring) – this is called convection and is a quick process. You can also wait patiently for the natural mixing of the two liquids – this is called diffusion, which is usually a very slow process.

Imagine a fish tank separated by a plate with different concentrations of brine on each side. After the plate is removed, the brine on both sides will gradually mix to the same concentration. The greater the difference in concentration, the more intense this process is.

Now imagine that a fish tank is divided into 256 × 256 parts by 256 × 256 plates (this number seems wrong), each part has a different concentration of salt water. If you remove all the baffles, there will be little diffusion between the similarly sized blocks, but there is a huge spread between the blocks with large differences in concentration. These small blocks are the pixels, and the density is the brightness of the pixels. The diffusion of concentration is the diffusion of pixel brightness.

This shows that the diffusion phenomenon is similar to the convolution – different concentrations of liquid in the initial state, or pixels of different intensities. In order to complete the next explanation, we also need to understand the propagator.

### Understanding propagator

The propagator is a density function that indicates in which direction the fluid particles should travel. The problem is that there is no such probability function in the neural network, only one convolution kernel – how do we unify these two concepts?

We can convert the convolution kernel into a probability density function by normalization. This is a bit like the softmax that calculates the output value. Here is the softmax result for the convolution kernel in the first example:

Now we can understand the convolution on the image from the perspective of diffusion. We can understand convolution as two diffusion processes. First, when the pixel brightness changes (black to white, etc.), diffusion occurs; then the diffusion of a certain region satisfies the probability distribution corresponding to the convolution kernel. This means that the pixels in the area being processed by the convolution kernel must spread according to these probabilities.

In the edge detector above, almost all of the information near the edge is concentrated on the edge (this is not possible in fluid diffusion, but the explanation here is mathematically true). For example, all pixels below 0.0001 are likely to flow to the middle and add up. The area that is most different from the surrounding pixels becomes a concentrated area of intensity because the diffusion is the most intense. Conversely, the place where the intensity is most concentrated indicates that the contrast with the surroundings is the strongest, which is the edge of the object, which explains why this core is an edge detector.

So we got a physical explanation: Convolution is the diffusion of information. We can apply this interpretation directly to other cores. Sometimes we need to perform a softmax normalization before we can explain it, but generally the number in the kernel is enough to explain what it wants. For example, can you infer the intent of the following core?

### Wait, a little confused

How do you have a definite function for a probabilistic convolution kernel? We must calculate the diffusion of individual particles based on the probability distribution of the kernel, that is, the propagator.

Yes, it is. However, if you take a small amount of liquid, such as a drop of water, you still have millions of water molecules. Although the random movement of a single molecule satisfies the propagator, the macroscopic performance of a large number of molecules is basically determined. This is a statistical explanation and an explanation of fluid mechanics. We can interpret the probability distribution of the propagator as the average distribution of information or pixel brightness; that is, our interpretation is not problematic from a hydrodynamic point of view. Having said that, there is also a random interpretation of the convolution.

### Inspiration from quantum mechanics

Propagators are important concepts in quantum mechanics. In quantum mechanics, a particle may be in a superposition state, in which case it has two or more properties that make it impossible to determine a specific location in the observed world. For example, a particle may exist in two different locations at the same time.

But if you measure the state of the particle—for example, where the particle is now—it can only exist in one specific location. In other words, you destroy the superposition of the particles by observation. The propagator describes the probability distribution of the location of the particles. For example, after measurement, a particle may be—based on the probability function of the propagator—30% at A and 70% at B.

Through quantum entanglement, several particles can simultaneously store hundreds or millions of states – this is the power of quantum computers.

If we use this interpretation for deep learning, we can imagine the picture as being in a superposition state, so in each 3*3 block, each pixel appears in 9 positions at the same time. Once we applied the convolution, we performed an observation, and then each pixel collapsed to a single position that satisfies the probability distribution, and the resulting single pixel is the average of all pixels. In order for this interpretation to be true, it must be ensured that the convolution is a random process. This means that the same convolution kernel for the same image will produce different results. This interpretation does not explicitly compare who is who, but may inspire you how to use convolution as a random process, or how to invent a convolutional network algorithm on a quantum computer. The quantum algorithm is able to calculate all possible combinations of states described by the convolution kernel in linear time.

### Inspiration from probability theory

Convolution and cross-correlation are closely linked. Cross-correlation is a means of measuring the similarity between small pieces of information (a few seconds of music) and large pieces of information (the whole piece of music) (youtube uses a similar technique to detect infringing videos).

Although the formula for cross-correlation seems difficult, we can immediately see its connection with deep learning by the following means. In the image search, we simply reverse the query image as a core and then perform a cross-correlation test by convolution. The result is a picture with one or more bright spots. The location of the highlight is the location of the face.

This example also shows a technique for stabilizing the Fourier transform by zeroing, which is used in many versions of the Fourier transform. There are also other padding techniques used: such as tile core, divide and conquer, and so on. I won’t talk about it. There are too many documents about the Fourier transform, and there are a lot of techniques inside—especially for images.

At the lower level, the first layer of the convolutional network does not perform cross-correlation check because the first layer performs edge detection. Subsequent layers get more abstract features and it is possible to perform cross-correlation. It is conceivable that these bright pixels will be passed to the unit that detects the human face (there are some units in the network structure of the Google Brain project that specifically recognize faces, cats, etc.; perhaps using cross-correlation?)

### Statistical inspiration

What is the difference between a statistical model and a machine learning model? Statistical models only care about very few, interpretable variables. Their purpose is often to answer the question: Is drug A better than drug B?

The machine learning model is focused on predictive effects: for age X, the cure rate for drug A is 17% higher than for B, and 23% for age Y.

Machine learning models are usually better at predicting than statistical models, but they are not so reliable. The statistical model is better at getting accurate and credible results: even if drug A is 17% better than B, we don’t know if this is accidental. We need statistical models to judge.

For time series data, there are two important models: the weighted moving average and the autoregressive model, which can be classified as the ARIMA model (autoregressive integrated moving average model). ARIMA is weaker than LSTM. But in low-dimensional data (1-5 dimensions), ARIMA is very robust. Although they are a bit hard to explain, ARIMA is by no means a black box like a deep learning algorithm. If you need a trusted model, this is a huge advantage.

We can write these statistical models in the form of convolution, and then the convolution in deep learning can be interpreted as a function that produces local ARIMA features. These two forms are not completely coincident and should be used with caution.

C is a function with a kernel as a parameter, and white noise is a normalized mean value of 0 unrelated data with a variance of 1.

When we preprocess data, we often process the data into a white noise-like form: move the data to a mean of 0 and adjust the variance to 1. We rarely remove the relevance of the data because of the high computational complexity. But conceptually it’s very simple, we rotate the axes to coincide with the feature vector of the data:

Now if we use C as a bias, we will think that this is very similar to a convolutional neural network. So the output of the convolutional layer can be interpreted as the output of white noise data through the autoregressive model.

The explanation for weighted moving average is simpler: the convolution of input data with a fixed core. Look at the Gaussian smooth kernel at the end of the text to understand this explanation. A Gaussian smooth kernel can be thought of as the average of each pixel and its neighbors, or each pixel is averaged by its neighbors (edge blur).

Although a single core cannot create autoregressive and weighted moving average features at the same time, we can use multiple cores to produce different features.

## To sum up

In this blog we know what convolution is and why it is so useful in deep learning. The interpretation of the picture block is easy to understand and calculate, but it has its theoretical limitations. By learning the Fourier transform, we know that there is a lot of information about the orientation of the object in the time domain after the Fourier transform. Through the powerful convolution theorem we understand that convolution is a flow of information between pixels. Later we extended the concept of propagators in quantum mechanics and obtained a random interpretation in the process of determination. We show the similarity between cross-correlation and convolution, and the performance of the convolutional network may be based on the degree of cross-correlation between feature maps, and the degree of cross-correlation is verified by convolution. Finally, we associate the convolution with two statistical models.

Personally, I think it is very interesting to write this blog. For a long time I felt that undergraduate math and statistics classes were a waste of time because they were too impractical (even if applied mathematics). But then – like the sudden grand prize – these knowledge are strung together and brought a new understanding. I think this is a wonderful example, and it is revealed that we should patiently study all the university courses – even if they seem useless at first.