Currently, I’m writing my master’s thesis on the subject of malleability in deep neural networks — that is, the benefits and detriments of giving a deep neural network more trainable parameters. Obviously weights and biases are trainable parameters, but the recent development of PReLUs introduced some trainable parameters into the activation functions to produce world-class results in image recognition.
I’m be discussing some new malleable constructs in my thesis, including Momentum ReLUs, Activation Pools (APs) and Parametric Activation Pools (PAPs). However this blog post will mostly be focusing on PAPS and their performance.
Note: Consider this research in progress. I acknowledge that there probably isn’t enough experimentation to prove that these concepts generalize well across different datasets and hyperparameters. All thoughts and comments are welcome.
The idea behind parametric activation pools is simple:
Each neuron has multiple activation functions instead of just one. Each activation function has a branch parameters $\alpha_1, \cdots, \alpha_n$: in the case of parametric activation pools, these are now trainable parameters. Introducing more parameters into the fold will increase computational complexity but hopefully will allow us to find better fits for our network.
My current intuition on why this works can be most easily explained when viewing the loss function as a function of two weights (on different layers).
In this particular example, I created a small convnet for CIFAR10. These two pictures are representative of the general shapes I get when plotting out the weights. (Note: This is not exhaustive or typical of every weight pair, only the majority.) My activation pool has two activation functions with trainable branches: (1) a ReLU and (2) a step function (for all intents and purposes, the derivative of the ReLU). Why does the PAP have a much clearer, almost trench-like minimum where convergence is very clear? Let’s back up for second and explain a few foundational concepts.
Imagine for a second that the branches are not trainable: what would you expect the output to look like if, say, 0.5 was going to the ReLU function ($P$) and 0.5 was going to the step function ($D$)? Essentially, we have created a PID controller, which is a common design in electrical control systems. Sparing you all the boring details, one can introduce a signal that is the derivative of your original signal into the mix — this adds more stability/resistance to change in your feedback loop. I was curious as to whether this applies to neural networks, and this appears to show in the empirical results.
For instance, compare this MReLU (Momentum ReLU) with fixed branching rates of 0.8 for $P$ and 0.2 for $D$ (which we expect to have a little more smoothness to it than a full-on ReLU)
to this configuration of the exact same net with 0.5 for $P$ and 0.5 for $D$ (which we expect to be a much more gradual slope)
If you want another way to think about this, the step function is, in some respects, a bias, when adds a constant amount of signal (recall that the derivative of a constant is 0). Thus, if the amount of $P$ and $D$ must equal 1.0 (which they should), the more $D$ you have, the less $P$ you have. Furthermore, because the derivative of $D$ is 0, the less $P$ you have, the smaller gradients you have. Thus, we can scale our rate of change of the derivative function by the constant amount of signal flowing into $D$.
EDIT: another small benefit is in some cases, this network can learn when normal ReLUs can’t. For instance, look at this simulation with a learning rate of 0.35 (very very high):
Now that we have covered the MReLU, assume that the neural network does have control over the branch parameters. In this way, the neural network can adjust the slope ($D \rightarrow 0$ = steep slope, $D \rightarrow 1$ = gradual slope) and the effect of the activation ($D + P \rightarrow 0$ = turn off, vice versa). This allows our network to not only decide which signals are important and which signals are not, but also adjust how quickly each pathway in the neural network converges based on how it effects the loss function.
Let’s take a look at some results:
|Activation Pool||0.0672 +/- 0.0022||0.9788 +/- 0.0007|
|ReLU||0.1102 +/- 0.0078||0.9671 +/- 0.0024|
This is just a taste with some caveats — MNIST isn’t a particularly hard dataset and 20 epochs isn’t particularly high. However, what we can say for certain is that on this dataset, PAPs converge more consistently to a better fit more quickly. I’m currently running some simulations on bigger datasets. I’ll update this post when I have finished those simulations.
Here’s everything my research has unearthed in a concise summary:
- Allow us to have some control how smooth the gradient function is.
- Can learn with very high learning rates because of their smoother nature, even though this is not necessarily the approach you would want to take (doesn’t work out well in practice).
Parametric Activation Pools using MReLU
- Allow the network to determine how important any given pathway is, turning them on or off by adjusting the $\alpha$ parameters.
- Training time is approximately 10% more
- Mean loss is approximately 60% of ReLU loss
- Std loss is approximately 3-4 times more consistent than ReLU
This is just the tip of the iceberg for this concept. There are many other considerations, such as how you treat the $\alpha$ parameters (should $\alpha_i$ the same for all nodes in a layer or should each node choose its own value for $\alpha_i$?). That may be another topic for another blog post though.