Understanding Deep Learning Textbook

In this note, I’ll share notes that I found insightful and interesting while reading Understanding Deep Learning, which I found through MIT’s public course on the subject.

Currently, I am at Chapter 9, and I’ll update the notes as I review my notes.

Ch. 1 - 2

The first two chapters just explain the big picture and show how to train a linear regression model (in 2 dimensions). Nothing that much interesting.

Time spent: 1 hour

Ch 3: Shallow neural networks

Time spent: 2.5 hours

The basics

The prelude to the fun begins here. Shallow neural networks have just 1 hidden layer with many hidden units. In the book, they consider an example with 3 hidden units (with 1d input and output), and the general formula for such a neural net would be

y = ϕ_{0} + ϕ_{1} \cdot a [θ_{10} + θ_{11} x] + ϕ_{2} \cdot a [θ_{20} + θ_{21} x] + ϕ_{3} \cdot a [θ_{30} + θ_{31} x]

where $a$ is the activation function, and most of the time it’s just ReLU. The interpretation/intuition of building them up is the key. If you define the hidden units as outputs of the activation function, as

h_{1} = a [θ_{10} + θ_{11} x] h_{2} = a [θ_{20} + θ_{21} x] h_{3} = a [θ_{30} + θ_{31} x]

then each of these $h_{i}$ ‘s are clipped linear function. The final output is just a linear combination of these hidden units (aka each is scaled by $ϕ_{i}$ and a bias term $ϕ_{0}$ is added). The whole pipeline is depicted below

As you can see the more hidden units you add the more joints you have in the final function, which is a piecewise linear function. This can be formalized in Universal Approximation Theorem, which states that any function (in any dimension!) can be approximated by a Shallow neural net. In practice, to approximate a complicated function you would need a LOT of hidden units, and that’s where Deep neural nets win over.

General case

In the multivariate input and output, the formula can be written as

h_{d} = a [θ_{d 0} + i = 1 \sum D_{i} θ_{d i} x_{i}] y_{j} = ϕ_{j 0} + d = 1 \sum D ϕ_{j d} h_{d}

where $D_{i}$ is the input dimension, $D$ is the number of hidden units. The clean way to write is to recognize the matrix multiplication in the sum:

h = a [b_{0} + Wx] y = b_{1} + Φh

where $W$ is a matrix of $D \times D_{i}$ , and $Φ$ is a matrix of $D_{0} \times D$ , and non-linear activation acts point-wise. One of the problems with Shallow neural nets and multivariate outputs is that $y_{1}, y_{2}$ would have ‘joints’ at the same points! You see how this is a problem, if number of hidden units is not huge?

Number of regions vs hidden units

Note that the higher the input dimension, the more linear regions you get. Intuitively, that kind of makes sense because in higher dimensions linear ‘regions’ become linear hyperplanes, and when you add them up, there are just more ways they can intersect. Quick estimation that I really like: let’s say we have $D_{i} = D$ , and each hyperplane gets activated along one of the axes. In 1D, you get activation at x=0, and $2^{1}$ linear regions. In 2D, you get activation at x=0 (line!), y=0 (line!) and you generate $2^{2} = 4$ linear regions. In 3D, you basically intersect 3 planes, and get $2^{3} = 8$ octants. I guess, you see the pattern. You can create $2^{D_{i}}$ linear regions !!!

Ch 4: Deep neural networks

*Time spent: 2.0 hours

🌱 Sanzhar B.

Explorer