Convolutional neural networks: 1. convolutional layers (2-dimensional)

From super flexible to somewhat flexible

In the first posting of this entire blog, we discussed how to build the most flexible form of a neural network called the feedforward neural network (FNN). There are various versions of universal approximation theorems that such neural networks can approximate any map $\mathbb{R}^m \rightarrow \mathbb{R}^d$ in a large class (e.g., continuous maps on a compact support with supreme norm or $L^p$ maps with $L^p$-norm), which we shall refer to as a universality result. Unfortunately, no universality result provides an algorithm to find an FNN.

Disclaimer. I have personally gone through a decent amount of references, and the underlined sentence above is what seems to be true based on my search rather than the absolute truth. However, it is evident that the stochastic gradient descent (SGD) or any of its variants cannot guarantee that each step (which we call back propagation) is actually a descent process, so it is safe to say that a mere learning algorithm we use does not guarantee approximating an arbitrary given map $\mathbb{R}^m \rightarrow \mathbb{R}^d$.

Remark. Despite the lack of practical algorithm, universality is itself is an interesting phenomenon, which gives us hope that if we keep working hard, maybe we will find an algorithm to find a (feedforward) neural network that approximates a map that describes training data the best as well as unseen data. There are many papers about the universality, but among them, I find Park et al (2021) very interesting. Their result tells us that there are many maps $\mathbb{R}^m \rightarrow \mathbb{R}^d$ that we cannot approximate well by an FFN unless all of our hidden layers have width at least $\max(m+1, d)$ for finding the best neural network $\mathbb{R}^m \rightarrow \mathbb{R}^d$.

Hidden simplicity. Then what do we need to do in order to find a practical algorithm to approximate a given map? Note that in practice we don't have a map; otherwise, we would have used such a map. We believe that a map that governs the training data as well as any unseen data exists with some error, and we want to get there as close as possible by choosing finitely many real numbers (i.e., parameters) to get a neural network that approximates such a map in our imagination.

Maybe, such a map is more specific than the ones appearing in the universality results due to "hidden simplicity" specific to the given task that we may not have considered when we thought about all kinds of possible maps. What do we mean by this? Let's consider the example of image classification.

Let $c$ be the number of colors we use (e.g., $c = 3$ when we use RGB) for an image, which is often called the number of channels. For each color, we have a grid of pixels

$h$ be the height, the number of pixels in vertical direction, and
$w$ the width, the number of pixels in horizontal direction.

Given a color, in each pixel, we expect a number that records the intensity of the color in that pixel. An image consists of $c$ grids with size $h \times w$. Mathematically, we can write the domain as the tensor product

$$\mathbb{R}^c \otimes \mathbb{R}^h \otimes \mathbb{R}^w \simeq \mathbb{R}^{c h w},$$

but it is more intuitively helpful to think of each element of this domain as $c$ matrices of size $h \times w$ over $\mathbb{R}$. Consider any such input element $(X[0], X[1], \dots, X[c-1])$, which can also be considered as a vector in $\mathbb{R}^{chw}$. We want to model a map such that if we input an image it tells us what the image is likely to represent among $d$ labels. Such a map should be of the form $\mathbb{R}^{chw} \rightarrow \mathbb{R}^d$.

Believing in universality results, we may try to find such a map by repeatedly applying SGD on an FNN. However, notice that it is quite likely that we may not need the full information about the input image for this task. When we are identifying whether a given image contains a dog or a cat, it is possible to do so even if we are just given the silhouette of the image. That is, there are many maps $\mathbb{R}^{chw} \rightarrow \mathbb{R}^2$ that we can discard by requiring that the maps need to have some filters that forget the full information about a given image, leaving the only necessary silhouette. (Here, we have $d = 2$ for 'dog' and 'cat'.)

Reference

For the rest of the posting, we mostly follow the lecture notes by Smets, which is an excellent reference for mathematicians who pursue deep learning. However, we do not follow it line-by-line, and the notation is also slightly different from the reference. Moreover, we add another exposition of explicitly computing the weight matrix for a convolutional layer to compare with that of a fully connected layer (terminologies to be explained below).

Convolutional layers

Each layer used in an FFN consists of a weight matrix, a bias vector, and an activation map. Such a layer is called a fully connected layer. What we now discuss is a way to create "filtering" to a fully connected layer. Again, our input is of the form

$$X = (X[0], X[1], \dots, X[c-1]) \in \mathbb{R}^{chw},$$

where each $X[k]$ is viewed as an $h \times w$ real matrix. We want to filter each $X[k]$, and it is done by a linear operation using an $m \times m$ matrix, which we call a kernel (also known as a filter or a weight). Often, we want to make sure that there are enough filters for each $X[k]$, so we give $c'$ filters $K[0,k], K[1,k], \dots, K[c'-1,k]$ to each $X[k]$.

Remark. We choose $m$ such that $1 \leq m < h, w$. Usually, we use $m = 3$ for image classification, but higher $m$ also gets used depending on the given task.

That is, we have $c' c$ filters, which accounts for $c'cm^2$ (trainable) parameters. We now explain how these filters constructs a weight matrix, which gives a linear map $\mathbb{R}^{chw} \rightarrow \mathbb{R}^{c'(h-m+1)(w-m+1)}$ for what we shall call a (2-dimensional) convolutional layer. That is, the size of the weight matrix will be $c'(h-m+1)(w-m+1) \times chw$.

Remark. Note that such a weight matrix would have had to train $c'chw(h-m+1)(w-m+1)$ parameters if our layer were a fully connected layer. Noting that $m$ is usually small, we note this is much more than $c'cm^2$ parameters. The number of the trainable parameters for the bias vector is $c'(h-m+1)(w-m+1)$ both for a fully connected layer. However, for a convolutional layer, we only allow $c'$ trainable parameters for the bias vector. (There does not seem to be any deep mathematical reason for this other than simplifying the resulting features.)

Recipe for a convolutional layer: cross-correlation. Now, how to get the weight matrix $W$ from $c'c$ filters $(K[l,k] : 0 \leq l \leq c'-1, 0 \leq k \leq c-1)$, which are $m \times m$ real matrices? We use what's called an operation called (discrete) cross-correlation. Given $X = (X[0], X[1], \dots, X[c-1]) \in \mathbb{R}^{chw}$, we define

$$W_X = (W_X[0], W_X[1], \dots, W_X[c'-1]) \in \mathbb{R}^{c'(h-m+1)(w-m+1)} \simeq \mathbb{R}^{c'} \otimes \mathrm{M}_{(h-m+1) \times (w-m+1)} (\mathbb{R})$$

by defining $W_X[l]$ as follows:

$$W_X[l] := \sum_{k=0}^{c-1} K[l,k] \star X[k],$$

where the $(i,j)$-entry of $K[l,k] \star X[k]$ is defined as follows:

$$(K[l,k] \star X[k])_{i, j} := \sum_{i',j'=0}^{m-1} K[l,k]_{i',j'} X[k]_{i + i', j+j'}$$

for $0 \leq i \leq h-m$ and $0 \leq j \leq w-m$.

It is important to note that the above process indeed defines a linear map $W : \mathbb{R}^{chw} \rightarrow \mathbb{R}^{c'(h-m+1)(w-m+1)}$.

Remark. Cross-correlation is a shifted version of convolution, which seems to be where the name convolutional layer came from as well as convolutional neural network. The convolutional layer we describe is ready to be implemented through PyTorch.

Example. Let $(c', c, m, h, w) = (1, 2, 3, 4, 5)$. Then we have the following set up:

$X = (X[0], X[1]) \in \mathbb{R}^{2} \otimes \mathbb{R}^{4 \cdot 5}$;
$K[0], K[1]$ are $3 \times 3$ matrices;
$W_X$ is a $2 \times 3$ matrix given by

$$W_X = K[0] \star X[0] + K[1] \star X[1].$$

For $0 \leq i \leq 1$ and $0 \leq j \leq 2$, the $(i,j)$-entry of $K[k] \star X[k]$ for fixed $k \in \{0, 1\}$ is

$$\begin{align*} (K[k] \star X[k])_{i, j} &:= \sum_{i',j'=0}^{2} K[k]_{i',j'} X[k]_{i + i', j+j'} \\ &= K[k]_{0,0} X[k]_{i, j} + K[k]_{0,1} X[k]_{i, j+1} + K[k]_{0,2} X[k]_{i, j+2} \\ &+ K[k]_{1,0} X[k]_{i+1, j} + K[k]_{1,1} X[k]_{i+1, j+1} + K[k]_{1,2} X[k]_{i+1, j+2} \\ &+ K[k]_{2,0} X[k]_{i+2, j} + K[k]_{2,1} X[k]_{i+2, j+1} + K[k]_{2,2} X[k]_{i+2, j+2}. \end{align*}$$

Let's describe $W : \mathbb{R}^{40} \rightarrow \mathbb{R}^{6}$ as a $6 \times 40$ matrix to compare with which trainable parameters in a fully connected layer are not trainable. The standard basis of $\mathbb{R}^{40} = \mathbb{R}^{2 \cdot 4 \cdot 5}$ in our setting is given as

$(e_{00}, 0), (e_{01}, 0), (e_{02}, 0), (e_{03}, 0),$
$(e_{10}, 0), (e_{11}, 0), (e_{12}, 0), (e_{13}, 0),$
$(e_{20}, 0), (e_{21}, 0), (e_{22}, 0), (e_{23}, 0),$
$(e_{30}, 0), (e_{31}, 0), (e_{32}, 0), (e_{33}, 0),$
$(e_{40}, 0), (e_{41}, 0), (e_{42}, 0), (e_{43}, 0),$
$(0, e_{00}), (0, e_{01}), (0, e_{02}), (0, e_{03}),$
$(0, e_{10}), (0, e_{11}), (0, e_{12}), (0, e_{13}),$
$(0, e_{20}), (0, e_{21}), (0, e_{22}), (0, e_{23}),$
$(0, e_{30}), (0, e_{31}), (0, e_{32}), (0, e_{33}),$
$(0, e_{40}), (0, e_{41}), (0, e_{42}), (0, e_{43}),$

where $e_{ij}$ is the $4 \times 5$ matrix whose $(i,j)$-entry is $1$ while all the other entries are $0$. The value of the first standard vector $(e_{00}, 0)$ under $W$ is

$$W_{(e_{00}, 0)} = \begin{bmatrix} K[0]_{0,0} & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}.$$ The value of the standard vector $(e_{01}, 0)$ under $W$ is

$$W_{(e_{01}, 0)} = \begin{bmatrix} K[0]_{0,1} & K[0]_{0,0} & 0 \\ 0 & 0 & 0 \end{bmatrix}.$$

The value of the standard vector $(e_{02}, 0)$ under $W$ is

$$W_{(e_{02}, 0)} = \begin{bmatrix} K[0]_{0,2} & K[0]_{0,1} & K[0]_{0,0} \\ 0 & 0 & 0 \end{bmatrix}.$$

The value of the standard vector $(e_{03}, 0)$ under $W$ is

$$W_{(e_{03}, 0)} = \begin{bmatrix} 0 & K[0]_{0,2} & K[0]_{0,1} \\ 0 & 0 & 0 \end{bmatrix}.$$

Let's try two more before we generalize: the value of the standard vector $(e_{12}, 0)$ under $W$ is

$$W_{(e_{12}, 0)} = \begin{bmatrix} K[0]_{1,2} & K[0]_{1,1} & K[0]_{1,0} \\ K[0]_{0,2} & K[0]_{0,1} & K[0]_{0,0} \end{bmatrix}.$$

The value of the standard vector $(e_{13}, 0)$ under $W$ is

$$W_{(e_{13}, 0)} = \begin{bmatrix} 0 & K[0]_{1,2} & K[0]_{1,1} \\ 0 & K[0]_{0,2} & K[0]_{0,1} \end{bmatrix}.$$

In general, we note that

$$W_{(e_{i'',j''}, 0)} = \begin{bmatrix} K[0]_{i'',j''} & K[0]_{i'',j''-1} & K[0]_{i'',j''-2} \\ K[0]_{i''-1,j''} & K[0]_{i''-1,j''-1} & K[0]_{i''-2,j''-2} \end{bmatrix},$$

where $K[0]_{i',j'}$ with index out of bound are considered to be $0$. Likewise, we have

$$W_{(0, e_{i'',j''})} = \begin{bmatrix} K[1]_{i'',j''} & K[1]_{i'',j''-1} & K[1]_{i'',j''-2} \\ K[1]_{i''-1,j''} & K[1]_{i''-1,j''-1} & K[1]_{i''-2,j''-2} \end{bmatrix}.$$

Thus, the $6 \times 40$ matrix representation of $W$ is given by concatenating the following two $6 \times 20$ matrices:

$W_0 := [W_{(e_{00},0)}, W_{(e_{01},0)}, W_{(e_{02},0)}, W_{(e_{03},0)}, \dots, W_{(e_{40},0)}, W_{(e_{41},0)}, W_{(e_{42},0)}, W_{(e_{43},0)}]$ and
$W_1 := [W_{(0, e_{00})}, W_{(0, e_{01})}, W_{(0,e_{02})}, W_{(0, e_{03})}, \dots, , W_{(0, e_{40})}, W_{(0, e_{41})}, W_{(0, e_{42})}, W_{(0, e_{43})}]$.

More explicitly, the first matrix $W_0$ can be given by concatenating the following five $6 \times 4$ matrices:

$[W_{(e_{00},0)}, W_{(e_{01},0)}, W_{(e_{02},0)}, W_{(e_{03},0)}] = \begin{bmatrix} K[0]_{0,0} & K[0]_{0,1} & K[0]_{0,2} & 0 \\ 0 & K[0]_{0,0} & K[0]_{0,1} & K[0]_{0,2} \\ 0 & 0 & K[0]_{0,0} & K[0]_{0,1} \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix},$
$[W_{(e_{10},0)}, W_{(e_{11},0)}, W_{(e_{12},0)}, W_{(e_{13},0)}] = \begin{bmatrix} K[0]_{1,0} & K[0]_{1,1} & K[0]_{1,2} & 0 \\ 0 & K[0]_{1,0} & K[0]_{1,1} & K[0]_{1,2} \\ 0 & 0 & K[0]_{1,0} & K[0]_{1,1} \\ K[0]_{0,0} & K[0]_{0,1} & K[0]_{0,2} & 0 \\ 0 & K[0]_{0,0} & K[0]_{0,1} & K[0]_{0,2} \\ 0 & 0 & K[0]_{0,0} & K[0]_{0,1} \end{bmatrix},$
$[W_{(e_{20},0)}, W_{(e_{21},0)}, W_{(e_{22},0)}, W_{(e_{23},0)}] = \begin{bmatrix} K[0]_{2,0} & K[0]_{2,1} & K[0]_{2,2} & 0 \\ 0 & K[0]_{2,0} & K[0]_{2,1} & K[0]_{2,2} \\ 0 & 0 & K[0]_{2,0} & K[0]_{2,1} \\ K[0]_{1,0} & K[0]_{1,1} & K[0]_{1,2} & 0 \\ 0 & K[0]_{1,0} & K[0]_{1,1} & K[0]_{1,2} \\ 0 & 0 & K[0]_{1,0} & K[0]_{1,1} \end{bmatrix},$
$[W_{(e_{30},0)}, W_{(e_{31},0)}, W_{(e_{32},0)}, W_{(e_{33},0)}] = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ K[0]_{2,0} & K[0]_{2,1} & K[0]_{2,2} & 0 \\ 0 & K[0]_{2,0} & K[0]_{2,1} & K[0]_{2,2} \\ 0 & 0 & K[0]_{2,0} & K[0]_{2,1} \end{bmatrix},$ and
$[W_{(e_{40},0)}, W_{(e_{41},0)}, W_{(e_{42},0)}, W_{(e_{43},0)}] = 0$.

The five $6 \times 4$ matrices that concatenate to $W_1$ can be given similarly by replacing $K[0]$ with $K[1]$ in the above expressions.

Remark. Note that the arguments given in the above example generalize to compute the weight matrix for any convolution layer given the filters. Note that in comparison to the fully connected layer, the weight matrix of a convolution layer is sparser (i.e., has more constant zeroes) and repeated trainable parameters.

Intuition / Visualization

See this excellent visualization by Francesco Visin.

What's next

We have only described a specific way to construct a layer of a neural network whose domain is $\mathbb{R}^{chw}$. In the next few postings, we continue to explain how we construct a full network $\mathbb{R}^{chw} \rightarrow \mathbb{R}^{d}$ from here.

Search This Blog

AI/ML for mathematicians