Three ingredients of a machine learning algorithm:
- The hypothesis class: describes how we map inputs to outputs.
- The loss function: specifies how well a given hypothesis.
- An optimization method: minimize the sum of losses over the training set.
Softmax / cross-entropy loss
Use softmax function normalize every dim of output, get the probability of every classes.
The gradient of the softmax objective.
Approach #1 (right way): Use matrix differential calculus, Jacobians, Kronecker products, and vectorization
Approach #2 (hacky quick way)：Pretend everything is a scalar, use the typical chain rule, and then rearrange/transpose matrices/vectors to make the sizes work (and check your answer numerically).
早期的框架使用Backprop算法在原始的计算图上计算梯度，目前主流的框架使用Reverse mode AD来计算梯度，即构建一个反向的计算图来计算梯度。
使用Reverse mode AD的好处：
Trere are many method to get differentiation.
Numerical differentiation is suffer from numerical error and less efficient to compute .
However, numerical differentiation is a powerful tool to check an implement of an automatic differentiation algorithm in unit test cases.
Write down the formulas, derive the gradient by sum, product and chain rules.
Naively do so can result in wasted computations.
It cost $n(n-1)$ multiplies to compute all partial gradients.
Each node represent an (intermediate) value in the computation. Edges present input output relations.
Forward mode automatic differentiation (AD)
The limitation of forward mode AD: For $f: \R^n \rightarrow \R^k$，we need $n$ forward AD passes to get the gradient with respect to each input.
We mostly care about the cases where $k = 1$ and large $n$. In order to resolve the problem efficiently, we need to use another kind of AD.
Reverse mode automatic differentiation (Reverse mode AD)
Derivation for the muliple pathway case.
Reverse AD algorithm
Difference between Backprop and Reverse mode AD
Why we should take Reverse mode AD?
Reverse mode is easy to do gradient of gradient by construct another compute graph.
It also bring a lot more opportunities for underlying machine learning framework to do certain optimizations for the gradient computation.