Benchmarking ReLU and PReLU using MNIST and Theano

The abilities of deep learning are fascinating, just as this Paschke arch CC by  David DeHetre
The abilities of deep learning are fascinating, just as this Paschke arch CC by David DeHetre

One of the successful insights to training neural networks has been the rectified linear unit, or short the ReLU, as a fast alternative to the traditional activation functions such as the sigmoid or the tanh. One of the major advantages of the simle ReLu is that it does not saturate at the upper end, thus the network is able to distinguish a poor answer from a really poor answer and correct accordingly.

A schematic of the PReLU. The PReLU has the same schematic with the only difference being the α being a constant. Curtesy PReLU article.
A schematic of the PReLU. The LReLU has the same schematic with the only difference being the α being a constant. Curtesy PReLU article.

A modification to the ReLU, the Leaky ReLU, that would not saturate in the opposite direction has been tested but did not help. Interestingly in a recent paper by the Microsoft© deep learning team, He et al. revisited the subject and introduced a Parametric ReLU, the PReLU, achieving superhuman performance on the imagenet. The PReLU learns the parameter α (alpha) and adjusts it through basic gradient descent.

In this tutorial I will benchmark a few different implementations of the ReLU and PReLU together with Theano. The benchmark test will be on the MNIST database, mostly for convenience.

Why Theano

Coming from an R environment I tried to find a good deep learning alternative in R. Unfortunately the graphics card integration is often lacking and it seems that the other alternatives are much further along. I chose Theano as this is one of the most popular packages and it compiles everything at the back-end for speed. There are several packages that build upon Theano but I figured it was just as well to learn something from the core.

Possible ReLU and PReLU implementations

I’ve come across a few different ReLU implementations:

The only one that is slightly less intuitive is the third one where the the absolute causes the value to cancel out while the positive values are divided by one half. For obvious reasons only the ReLU2 and ReLU3 are possible to adapt to a PReLU version:

Note that this also requires the alpha parameters for the PReLU set. These need to correspond to the number of activations in the corresponding layer and be included in the update function – here’s an abstract of the PReLU test function that takes care of this. Note the calculations of the input sizes and how they relate as this is crucial for setting the correct alpha shapes:

Setting up the MNIST

I rely on the excellent tutorial by Alec Radford for loading the MNIST database:

As the MNIST is almost too easy we’ll limit the dataset to 1/6 of the original size:

The basic ReLU benchmark functions

The network is identical to that of Alec’s original net that attains about 99.5% accuracy on the full dataset after 30 epochs.

The PReLU training is identical with a few small exceptions:

Results and conclusions

My three main conclusions are:

  • The maximum and the absolute calculations seem to have performed equally fast.
  • The added time using PReLU is minimal.
  • Similarly the added precision is minimal, although PReLU seems to slightly faster find the sweet-spot.

The latter point is hard to really on due to the limited complexity of the MNIST database, I would expect that PReLU comes in handy when dealing with more complex tasks.

Using some R-code I created a few plots illustrating the above conclusions (after some googling and getting an error when installing the python ggplot I gave up):

Bar chart comparing the ReLUs and PReLUs time at the end of 30 epochs
Bar chart comparing the ReLUs and PReLUs time at the end of 30 epochs

A line chart illustrating the lack of difference in accuracy between the methods
A line chart illustrating the lack of difference in accuracy between the methods

The α values

Interestingly the α (alpha) values behaved in a similar fashion to that in the original article, alphas in the lower layers were higher compared to lower layers. Here’s a shortened sample from the PReLU3 print:

Deriving the ReLU/PReLU

Part of the speed impacting the implementation is the derivative. From what I understand this is something that Theano does in the background using the grad-function:

Give the somewhat harder to read 2 * x:

Using the same approach for the maximum function gives:

It is readable but hardly intuitive that the meaning is x > 0 ? 1 : 0:

The absolute calculation ((x + abs(x)) / 2.0) gives the rather mindnumming that I think reduces to (1 / 2 + 1 / 2 * x / |x|):

And if your want to get a real headache, here’s the PReLU winner and it’s two derivatives:

I haven’t even tried to deduce the elements… not sure I can even find x > 0 ? 1 : α in this mess…

Using Jan Schlüter’s approach

As Jan Schlüter points out the correct way of analyzing the output is through the debugprint method. Unfortunately, I find it not that much easier to read:

The switch-variant:

The absolute variant 1:

The absolute variant 2:

Environment

The benchmark was performed on a cuDNN-enabled K40c GPU together with Theano 0.7 and Ubuntu 14.04.

Flattr this!

This entry was posted in Research, Theano, Tutorial and tagged , , , . Bookmark the permalink.

3 Responses to Benchmarking ReLU and PReLU using MNIST and Theano

  1. Hey, nice post, but there are two traps you’ve fallen into as a new Theano user.
    1. If you want to benchmark the performance of the rectifier nonlinearity itself (whether leaky/parameterized or not), timing a full network may introduce too much uncertainty. Theano provides a way to profile its operations, though, and you can use that to profile just the forward pass or just the backward pass (gradient) of the activation function, as I did here: https://github.com/Lasagne/Lasagne/pull/163#issuecomment-85635041
    2. Your printout of the symbolic expressions is more complicated than what’s computed by Theano. On compiling a function, several graph optimizers modify the expression to be simpler or more numerically stable. So instead, you should compile a function computing the expression and use theano.printing.debugprint on it. In the github issue linked above, I show how to do that as well.
    Besides, your PReLU3 expression can be further simplified, as I’ve done here: https://github.com/Lasagne/Lasagne/commit/ab0fef321f98823e37d9250c044ed907d8996f91
    Feel free to amend your post or link to the discussion on github!
    Cheers, Jan

    • Max Gordon says:

      Thanks for the feedback! Sorry for not getting back earlier but I’ve had a lot to take care of before I could rerun the code with your input. I’ve added your variant and also a little debugprint output. Should probably spend more time looking at exactly on how to interpret the debugprint, but it is nice to at least know of the correct Theano way to look at the end result.

      I’m also glad that the discussion regarding adding a default ReLU-implementation to Theano has taken off. I’m aware of the problem when benchmarking in a full network, this has been my playground for learning the basics and the benchmark addition was just something I did when trying to figure out how to implement the PReLU.

      Best
      Max

  2. Pingback: Debug a deep Neural Networks | Keunwoo Choi

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.