Convolutional neural networks: 1. convolutional layers (2-dimensional)
From super flexible to somewhat flexible
In the first posting of this entire blog, we discussed how to build the most flexible form of a neural network called the feedforward neural network (FNN). There are various versions of universal approximation theorems that such neural networks can approximate any map $\mathbb{R}^m \rightarrow \mathbb{R}^d$ in a large class (e.g., continuous maps on a compact support with supreme norm or $L^p$ maps with $L^p$-norm), which we shall refer to as a universality result. Unfortunately, no universality result provides an algorithm to find an FNN.
Disclaimer. I have personally gone through a decent amount of references, and the underlined sentence above is what seems to be true based on my search rather than the absolute truth. However, it is evident that the stochastic gradient descent (SGD) or any of its variants cannot guarantee that each step (which we call back propagation) is actually a descent process, so it is safe to say that a mere learning algorithm we use does not guarantee approximating an arbitrary given map $\mathbb{R}^m \rightarrow \mathbb{R}^d$.
Remark. Despite the lack of practical algorithm, universality is itself is an interesting phenomenon, which gives us hope that if we keep working hard, maybe we will find an algorithm to find a (feedforward) neural network that approximates a map that describes training data the best as well as unseen data. There are many papers about the universality, but among them, I find Park et al (2021) very interesting. Their result tells us that there are many maps $\mathbb{R}^m \rightarrow \mathbb{R}^d$ that we cannot approximate well by an FFN unless all of our hidden layers have width at least $\max(m+1, d)$ for finding the best neural network $\mathbb{R}^m \rightarrow \mathbb{R}^d$.
Hidden simplicity. Then what do we need to do in order to find a practical algorithm to approximate a given map? Note that in practice we don't have a map; otherwise, we would have used such a map. We believe that a map that governs the training data as well as any unseen data exists with some error, and we want to get there as close as possible by choosing finitely many real numbers (i.e., parameters) to get a neural network that approximates such a map in our imagination.
Maybe, such a map is more specific than the ones appearing in the universality results due to "hidden simplicity" specific to the given task that we may not have considered when we thought about all kinds of possible maps. What do we mean by this? Let's consider the example of image classification.
Let $c$ be the number of colors we use (e.g., $c = 3$ when we use RGB) for an image, which is often called the number of channels. For each color, we have a grid of pixels
- $h$ be the height, the number of pixels in vertical direction, and
- $w$ the width, the number of pixels in horizontal direction.
$$\mathbb{R}^c \otimes \mathbb{R}^h \otimes \mathbb{R}^w \simeq \mathbb{R}^{c h w},$$
but it is more intuitively helpful to think of each element of this domain as $c$ matrices of size $h \times w$ over $\mathbb{R}$. Consider any such input element $(X[0], X[1], \dots, X[c-1])$, which can also be considered as a vector in $\mathbb{R}^{chw}$. We want to model a map such that if we input an image it tells us what the image is likely to represent among $d$ labels. Such a map should be of the form $\mathbb{R}^{chw} \rightarrow \mathbb{R}^d$.
Believing in universality results, we may try to find such a map by repeatedly applying SGD on an FNN. However, notice that it is quite likely that we may not need the full information about the input image for this task. When we are identifying whether a given image contains a dog or a cat, it is possible to do so even if we are just given the silhouette of the image. That is, there are many maps $\mathbb{R}^{chw} \rightarrow \mathbb{R}^2$ that we can discard by requiring that the maps need to have some filters that forget the full information about a given image, leaving the only necessary silhouette. (Here, we have $d = 2$ for 'dog' and 'cat'.)
Reference
For the rest of the posting, we mostly follow the lecture notes by Smets, which is an excellent reference for mathematicians who pursue deep learning. However, we do not follow it line-by-line, and the notation is also slightly different from the reference. Moreover, we add another exposition of explicitly computing the weight matrix for a convolutional layer to compare with that of a fully connected layer (terminologies to be explained below).
Convolutional layers
- $X = (X[0], X[1]) \in \mathbb{R}^{2} \otimes \mathbb{R}^{4 \cdot 5}$;
- $K[0], K[1]$ are $3 \times 3$ matrices;
- $W_X$ is a $2 \times 3$ matrix given by
- $(e_{00}, 0), (e_{01}, 0), (e_{02}, 0), (e_{03}, 0),$
- $(e_{10}, 0), (e_{11}, 0), (e_{12}, 0), (e_{13}, 0),$
- $(e_{20}, 0), (e_{21}, 0), (e_{22}, 0), (e_{23}, 0),$
- $(e_{30}, 0), (e_{31}, 0), (e_{32}, 0), (e_{33}, 0),$
- $(e_{40}, 0), (e_{41}, 0), (e_{42}, 0), (e_{43}, 0),$
- $(0, e_{00}), (0, e_{01}), (0, e_{02}), (0, e_{03}),$
- $(0, e_{10}), (0, e_{11}), (0, e_{12}), (0, e_{13}),$
- $(0, e_{20}), (0, e_{21}), (0, e_{22}), (0, e_{23}),$
- $(0, e_{30}), (0, e_{31}), (0, e_{32}), (0, e_{33}),$
- $(0, e_{40}), (0, e_{41}), (0, e_{42}), (0, e_{43}),$
- $W_0 := [W_{(e_{00},0)}, W_{(e_{01},0)}, W_{(e_{02},0)}, W_{(e_{03},0)}, \dots, W_{(e_{40},0)}, W_{(e_{41},0)}, W_{(e_{42},0)}, W_{(e_{43},0)}]$ and
- $W_1 := [W_{(0, e_{00})}, W_{(0, e_{01})}, W_{(0,e_{02})}, W_{(0, e_{03})}, \dots, , W_{(0, e_{40})}, W_{(0, e_{41})}, W_{(0, e_{42})}, W_{(0, e_{43})}]$.
- $[W_{(e_{00},0)}, W_{(e_{01},0)}, W_{(e_{02},0)}, W_{(e_{03},0)}] = \begin{bmatrix} K[0]_{0,0} & K[0]_{0,1} & K[0]_{0,2} & 0 \\ 0 & K[0]_{0,0} & K[0]_{0,1} & K[0]_{0,2} \\ 0 & 0 & K[0]_{0,0} & K[0]_{0,1} \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix},$
- $[W_{(e_{10},0)}, W_{(e_{11},0)}, W_{(e_{12},0)}, W_{(e_{13},0)}] = \begin{bmatrix} K[0]_{1,0} & K[0]_{1,1} & K[0]_{1,2} & 0 \\ 0 & K[0]_{1,0} & K[0]_{1,1} & K[0]_{1,2} \\ 0 & 0 & K[0]_{1,0} & K[0]_{1,1} \\ K[0]_{0,0} & K[0]_{0,1} & K[0]_{0,2} & 0 \\ 0 & K[0]_{0,0} & K[0]_{0,1} & K[0]_{0,2} \\ 0 & 0 & K[0]_{0,0} & K[0]_{0,1} \end{bmatrix},$
- $[W_{(e_{20},0)}, W_{(e_{21},0)}, W_{(e_{22},0)}, W_{(e_{23},0)}] = \begin{bmatrix} K[0]_{2,0} & K[0]_{2,1} & K[0]_{2,2} & 0 \\ 0 & K[0]_{2,0} & K[0]_{2,1} & K[0]_{2,2} \\ 0 & 0 & K[0]_{2,0} & K[0]_{2,1} \\ K[0]_{1,0} & K[0]_{1,1} & K[0]_{1,2} & 0 \\ 0 & K[0]_{1,0} & K[0]_{1,1} & K[0]_{1,2} \\ 0 & 0 & K[0]_{1,0} & K[0]_{1,1} \end{bmatrix},$
- $[W_{(e_{30},0)}, W_{(e_{31},0)}, W_{(e_{32},0)}, W_{(e_{33},0)}] = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ K[0]_{2,0} & K[0]_{2,1} & K[0]_{2,2} & 0 \\ 0 & K[0]_{2,0} & K[0]_{2,1} & K[0]_{2,2} \\ 0 & 0 & K[0]_{2,0} & K[0]_{2,1} \end{bmatrix},$ and
- $[W_{(e_{40},0)}, W_{(e_{41},0)}, W_{(e_{42},0)}, W_{(e_{43},0)}] = 0$.
Comments
Post a Comment