Purpose of this blog
The goal of this blog is to unravel ideas in artificial intelligence (AI) and machine learning (ML) for people who have some training in mathematics but not necessarily in computer science. This does not mean that I will always "prove" things. Rather, I will often try to find heuristic explanations that are mathematically sounding or summarize ideas of some important proofs.
As the first posting of this blog, we start off with understanding how a neural network works, which is central to AI. Of course, this is a vast topic, so we will only study essential principles on how it works now and discuss more specific topics in separate postings. The content of this posting is technically about feedforward neural networks, which are neural networks of the simplest architecture. However, I will try to explain some intuitions on how to generalize to other architectures at the end so that we feel more natural when we discuss those in later postings.
Remark about references. I first encountered this content from the excellent expository paper by Higham and Higham, but the discussion of this posting will be more general than the article of Highams. A more advanced but also equally well-written article is the lecture notes by Smets, which I plan to use it for convolutional neural networks in a later posting. Other references are given within the content as needed.
Remark. If I add any implementation (which will most likely use PyTorch for neural networks), I will add a link to another page to show codes because there does not seem to be a canonical way to write codes in Blogger.com. I plan to use Medium in the future, which does seem to support implementations better, but I will use Medium only for a specific subset of this blog.
Overview
Just like any other predictive modeling method in machine learning (ML), a neural network is a way to model a map
$$\mathbb{R}^m \rightarrow \mathbb{R}^d$$
that predicts a future outcome. Here, the integer $m \geq 1$ can be thought as the number of features, and $d$ depends on the purpose. For example, if we deal with a regression model where the target variable is valued in $\mathbb{R}$, we have $d = 1$. If we deal with a classification model with $5$ classes, then $d = 5$.
Remark. Of course, depending on the context, the integer $m$ may not necessarily mean the number of features, but I personally find it easier to think like this so that I can compare neural networks to other more traditional ML models.
Typically, a neural network consists of layers. The above description of a neural network already shows two different layers:
- input layer: $\mathbb{R}^m$ and
- output layer: $\mathbb{R}^d$.
In general, a neural network above consists of more layers. That is, it is a composition of the following form:
$$\mathbb{R}^m \rightarrow \mathbb{R}^{m_1} \rightarrow \mathbb{R}^{m_2} \rightarrow \cdots \rightarrow \mathbb{R}^{m_k} \rightarrow \mathbb{R}^d$$
with integer $k \geq 0$. The layers that are not input nor output layers are called the hidden layers; the number of hidden layers above is $k$. Writing $m_0 := m$ and $m_{k+1} := d$, each $\mathbb{R}^{m_{i-1}} \rightarrow \mathbb{R}^{m_{i}}$ is of the form
$$\boldsymbol{x} \mapsto \sigma_i(W_i\boldsymbol{x} + \boldsymbol{b}_i),$$
which can be views as a composition
$$\mathbb{R}^{m_{i-1}} \overset{W_i}{\longrightarrow} \mathbb{R}^{m_{i}} \overset{+ \boldsymbol{b}_i}{\longrightarrow} \mathbb{R}^{m_{i}} \overset{\sigma_i}{\longrightarrow} \mathbb{R}^{m_{i}}$$
where
- $W_i$ is an $m_{i+1} \times m_i$ real matrix, called the $i$-th weight matrix;
- $\boldsymbol{b}_i \in \mathbb{R}^{m_{i}}$, called the $i$-th bias vector;
- $\sigma_i : \mathbb{R}^{m_{i}} \rightarrow \mathbb{R}^{m_{i}}$ is a map called the $i$-th activation.
Usually, the activations $\sigma_i$ are fixed and component-wise defined. Three well-used activation functions are given by the following component functions:
- ReLU: $\mathbb{R} \rightarrow \mathbb{R}$ given by $x \mapsto \max(x, 0)$;
- Sigmoid: $\mathbb{R} \rightarrow \mathbb{R}$ given by $x \mapsto 1/(1 + e^{-x})$;
- Hyperbolic tangent: $\mathbb{R} \rightarrow \mathbb{R}$ given by $x \mapsto \tanh(x) = (e^{x} - e^{-x})/(e^{x} + e^{-x})$.
We shall assume that each component activation function is
- infinitely differentiable possibly except at finitely many points (i.e.,, it may not be differentiable at finitely many points);
- continuous everywhere.
What to model?
Our modeling (or fitting) process fixes $m, d, k, m_1, \dots, m_k,$ and the activations $\sigma_1, \dots, \sigma_{k}, \sigma_{k+1}$ and asks what weight matrices and bias vectors would give us the best fitting model. This means that the number of parameters to be determined is
$$m_1m + m_2 m_1 + \cdots + m_km_{k-1} + dm_k + m_1 + m_2 + \cdots + m_k,$$
unless we restrict the choices of weight matrices and bias vectors further. (This is how more advanced neural networks choose their models.) Thus, if we do not introduce extra constraints but choose large $m_1, \dots, m_k$, we would need many equations to constraint these parameters. Intuitively, this would mean that our training data needs to be large in order for our model to avoid from guessing randomly within many free choices.
How to fit?
Fitting depends on the purpose of the task, which is incorporated into a specific definition of a loss function
$$L : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}.$$
Given a training set $T \subset \mathbb{R}^m \times \mathbb{R}^d$, which is a finite subset, we first partition $T$ into batches, say $T_1, \dots, T_r$. Fitting goes through every batch in one epoch. You may run multiple epochs if that seems needed, although there does not seem to have much benefit in going through multiple epochs. (I could be wrong on this, and if I am, I will change my opinion later.)
Fitting in $T_j$ process goes as follows. Note that deciding the model $\hat{f} : \mathbb{R}^m \rightarrow \mathbb{R}^d$ is equivalent to deciding the corresponding parameters $W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k$. That is, we may write
$$\hat{f} = \hat{f}_{W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k}.$$
We can then consider the map
$$\mathbb{R}^{m_1m + m_1} \times \mathbb{R}^{m_{2} m_{1}+m_2} \times \cdots \times \mathbb{R}^{m_{k} m_{k-1} + m_k} \times \mathbb{R}^m \rightarrow \mathbb{R}^d$$
given by
$$(W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k, \boldsymbol{x}) \mapsto \hat{f}_{W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k}(\boldsymbol{x}).$$
This map is infinitely differentiable possibly except in finitely many hyperplanes (e.g., if we use any ReLU activation). We then consider this map with the loss function as follows:
$$\tilde{L} : \mathbb{R}^{m_1m + m_1} \times \mathbb{R}^{m_{2} m_{1}+m_2} \times \cdots \times \mathbb{R}^{m_{k} m_{k-1} + m_k} \times \mathbb{R}^m \times \mathbb{R}^d \rightarrow \mathbb{R}^d \times \mathbb{R}^d \overset{L}{\longrightarrow} \mathbb{R},$$
given by
$$(W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k, \boldsymbol{x}, \boldsymbol{y}) \mapsto L(\hat{f}_{W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k}(\boldsymbol{x}), \boldsymbol{y}).$$
Goal of fitting. Minimize $\tilde{L}$ over the training set $T_j.$ That is, our goal is to choose $W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k$ so that
$$(L(\hat{f}_{W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k}(\boldsymbol{x}), \boldsymbol{y}) : (\boldsymbol{x}, \boldsymbol{y}) \in T_j)$$
are minimized where
$$\hat{f} = \hat{f}_{W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k}.$$
Highams used the
mean squared error (MSE) for the loss function $L$. For the MSE loss function, we all of our activations are identity maps, then the model should be identical to what we get from the simple method of
linear regression. (The linear regression would only give us $\hat{f}$ without weight matrices and bias vectors.)
Fix $(\boldsymbol{x}, \boldsymbol{y}) \in T_j$ and consider
$$L_{(\boldsymbol{x}, \boldsymbol{y})} : \mathbb{R}^{m_1m + m_1} \times \mathbb{R}^{m_{2} m_{1}+m_2} \times \cdots \times \mathbb{R}^{m_{k} m_{k-1} + m_k} \rightarrow \mathbb{R}$$
given by
$$(W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k, \boldsymbol{x}) \mapsto L(\hat{f}_{W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k}(\boldsymbol{x}), \boldsymbol{y}).$$
Remark. For convenience, we write $\mathbb{R}^N$ to mean the domain of $L_{(\boldsymbol{x}, \boldsymbol{y})}$. That is, we let
$$N := m_1m + m_2 m_1 + \cdots + m_km_{k-1} + dm_k + m_1 + m_2 + \cdots + m_k.$$
Now, repeating our goal in simplified notation, we want to find $\boldsymbol{p} = (W_1, \boldsymbol{b}_1, \dots, W_k, \boldsymbol{b}_k) \in \mathbb{R}^N$ that minimizes
$$(L_{(\boldsymbol{x}, \boldsymbol{y})}(\boldsymbol{p}) : (\boldsymbol{x}, \boldsymbol{y}) \in T_j).$$
Stochastic descent. Before we go into how to minimize, it is unclear what to minimize. Perhaps, we want to minimize
$$\frac{1}{|T_j|}\sum_{(\boldsymbol{x}, \boldsymbol{y}) \in T_j}L_{(\boldsymbol{x}, \boldsymbol{y})}(\boldsymbol{p})^2$$
or
$$\frac{1}{|T_j|}\sum_{(\boldsymbol{x}, \boldsymbol{y}) \in T_j}|L_{(\boldsymbol{x}, \boldsymbol{y})}(\boldsymbol{p})|$$
Either is fair, but it seems that we use
$$L_j(\boldsymbol{p}) := \frac{1}{|T_j|}\sum_{(\boldsymbol{x}, \boldsymbol{y}) \in T_j}L_{(\boldsymbol{x}, \boldsymbol{y})}(\boldsymbol{p})$$
instead. Often the loss functions are designed to taken non-negative values only, which which match with one of our intuitive choices, but perhaps otherwise, we think of $L_j$ a good representative of the loss and call it a day.
The overall ideas of any optimization algorithm for neural networks seem to be the following:
We first check whether our parameters $\boldsymbol{p}$ minimizes $L_1(\boldsymbol{p})$. If so (which certainly will not happen), then we are done. Otherwise, we update $\boldsymbol{p}$ so that $L_1(\boldsymbol{p})$ is very likely to be smaller than before. Then we check if $L_2(\boldsymbol{p})$ is minimized, which certainly won't be the case, but then we can update $\boldsymbol{p}$ so that $L_2(\boldsymbol{p})$ is very likely to be smaller than before. Then we move on to $L_3(\boldsymbol{p})$, and repeat this process until we do the same for $L_r(\boldsymbol{p}).$ This descent process of going through one of $L_1, \dots, L_r$ at a time is indicated by the adjective "stochastic".
Now, each optimization algorithm for stochastic descent may use a different strategy in updating $\boldsymbol{p}$ to make sure each $L_j(\boldsymbol{p})$ is smaller than before. The simplest such kind is called the
stochastic gradient descent (SGD), which is the default setting for
torch.optim.SGD class in PyTorch library.
For SGD, we consider the gradients $\nabla L_1, \dots, \nabla L_r : \mathbb{R}^N \rightarrow \mathbb{R}^N$ of $L_1, \dots, L_r$ possibly except on finitely many hyperplanes. Recall that the gradient of $L_j$ at any point is a vector that points to the maximum increment of $L_j$ at the point. Here is the basic algorithm for SGD:
- Pick $h > 0$ which is called the learning rate; the smaller $h$ is, the better it is for possibility of reducing the loss, but also note that it is less likely to reduce a loss by large margin if it is too small.
- Find any $\boldsymbol{p}_0 \in \mathbb{R}^N$. If $\nabla L_1(\boldsymbol{p}_0) \neq 0$, then it is necessarily true that $L_1$ is NOT minimized at $\boldsymbol{p}_0.$
- Consider $\boldsymbol{p}_1 := \boldsymbol{p}_0 - h \nabla L_1(\boldsymbol{p}_0).$ The smaller $h$ is, the more likely that $L_1({p}_1) \leq L_1(\boldsymbol{p}_0).$
- Now, recursively repeat the process: $\boldsymbol{p}_i := \boldsymbol{p}_{j-1} - h \nabla L_j(\boldsymbol{p}_{j-1}).$ The smaller $h$ is, the more likely that $L_j({p}_j) \leq L_j(\boldsymbol{p}_{j-1}).$
- Use $\boldsymbol{p}_r$ to choose parameters for our model.
What to consider in later postings
More complicated architectures for neural networks essentially restrict choices for the parameters during training using various operations. There are other ways (but still stochastic descent) than SGD to reduce loss in training. We may also discuss specific loss functions designed for specific tasks.
Comments
Post a Comment