# Deep Learning and Artificial General Intelligence

I’ve been studying and writing about deep learning (DL) for a few years now, and it still amazes the misinformation surrounding this relatively complex learning algorithm. This post is not about how deep learning is or is not over-hyped, as that is a well documented claim. Rather, it’s a jumping off point for a (hopefully) fresh, concise understanding of deep learning and its implications on artificial general intelligence (AGI). I’m going to be bold and try to make some claims on the role that this field of study will or will not play in the genesis of AGI. With all of the news on AI breakthroughs and non-industry commentators drawing rash conclusions about how deep learning will change the world, don’t we, as the deep learning community, owe it to the world to at least have our own camp in order?

[Note added 05/05/16: Please keep in mind that this is a blog post, not an academic paper. My goal was to express my thoughts and inspire some discussion about how we should contextualize deep learning, not to lay out a deeply technical argument. Obviously, a discussion of that magnitude could not be achieved in a few hundred words, and this post is aimed at the machine learning layman none-the-less. One body of text cannot be all things to all people.]

## The Problem

Even the most academic among us mistakenly merge two very different schools of thought in our discussions on deep learning:

1. The benefits of neural networks over other learning algorithms.
2. The benefits of a “deep” neural network architecture over a “shallow” architecture.

Much of the debating going on is surprisingly still concerned with the first point instead of the second. Let’s be clear — the inspiration for, benefits of, and detriments against neural networks are all well documented in the literature. Why are we still talking about this like the discussion is new? Nothing is more frustrating when discussing deep learning that someone explaining their views on why deep neural networks are “modeled after how the human brain works” and thus are the key to unlocking artificial general intelligence. This is an obvious Straw man, since this discussion is essentially the same as was produced when vanilla neural networks were introduced.

The idea I’d like for you to take away here is that we are not asking the right question for the answer which we desire. If we want to know how one can contextualize deep neural networks in the ever-increasing artificially intelligent world, we must answer the following question: what does increasing computing power and adding layers to a neural network actually allow us to do better than a normal neural network? Answering these could yield a fruitful discussion on deep learning.

If the first question is worn-out, let’s take on the second question: I believe that deep neural networks are more useful than traditional neural networks for three reasons:

1. The automatic encoding of features which previously had to be hand engineered.
2. The exploitation of structurally/spatially associated features.
3. Configurability through the use of stackable layers.

At the risk of sounding bold, that’s largely it. These are the only three benefits that I can recall in my time working with deep learning.

Assuming my above statements were true, what would we expect to see in the deep learning landscape? We might expect that deep neural networks would be most useful in learning problems where the data has some spatial qualities that can be exploited, such as image data, audio data, natural language processing, etc. Although we might say there are many areas that could benefit from that spatial exploitation, we would certainly not find that this algorithm was a magical cure for any data that you throw at it. We might find that deep learning helps self-driving cars perceive their environment through visual and radar-based sensory input, but not a network that can decide whether to protect it’s own driver or the pedestrian walking the street. Those that read the paper on AlphaGo will note that deep learning was simply a tool used by traditional AI algorithms.

Since I am feeling especially bold today, I will make another prediction: deep learning alone will not produce artificial general intelligence. There is simply not enough there to create such a complex system. I do think it’s not unreasonable to expect it will be used as a sensory processing system leveraged by more traditional artificial intelligence systems.

Stop studying deep learning thinking it will lay all other algorithms to waste. Stop throwing deep learning at every dataset you see. Start experimenting with these technologies outside of the “hello world” examples in the packages you use — you will quickly learn what they are actually useful for. Most of all, let’s stop viewing deep learning as a proof that we have almost achieved AGI and start viewing it for what it truly is: a tool that is useful in assisting a computer’s ability to perceive.

# Creating Your Own IPython-Like Server

Lately I’ve been using Jupyter (formerly, IPython) notebooks frequently for reproducible research, and I’ve been wondering how it all works underneath the hood. Furthermore, I’ve needed some custom functionality that IPython doesn’t include by default. Instead of extending IPython, I decided I would take a stab at building my own simple IPython kernel that runs on a remote server where my GPU farm lives. I won’t be worrying about security or concurrency, since I will be the only person with access to the server. The exercise should give you an idea about how server-based coding environments work in Python.

Since this is not a production server, Flask is perfect for our needs. Let’s start with a simple Flask server that does nothing. I’ll include some imports we will need later.

## Executing Code

There is really only one magical piece to cover here: how does Python take a string of code, execute it, then return the output? Let’s start with the novel approach.

You can execute any Python statement using the exec() command. I’m going to create a Flask endpoint that takes a POST parameter named ‘code’, splits the command by newlines, and runs each command in sequence. Here is what the code looks like.

Easy enough! You already have a minimal, Python-executing server in 15 lines of code (including unused imports and correct spacing). To test this, I use the POSTMAN client to hit my local server with POST requests.

Send a POST request to http:localhost:5000/ with the POST parameter ‘code’ set to print('hello world') like the picture below and hit ‘Send’. As expected, the server reads, the code, prints out ‘Hello world’, then exits.

## Redirecting Output

This isn’t very useful to us yet — although the server successfully receives and executes the code, the client only receives a “Success” message. Ideally, we would want to redirect the output from the program executing back to the client. To achieve this, we must capture what is being written to standard out buffer into a string buffer and return this string to the client. After some research, I determined this could be done by temporarily redirecting standard out to a StringIO buffer, like so:

Looking at the output from the Postman Client, we can see that the server is now relaying back the stdout to the client as expected.

Note: Redirecting standard out in this way will redirect the output for all clients connecting. Thus, if you have multiple people running code at the exact same time, the outputs will overlap. Don’t do this. That’s why I noted this was not a production ready server.

## Different Environments

There is another major problem in our implementation — everything is executed in the same environment. One of the nice things about IPython is that you can work in several different notebooks at the same time, and none of the variables or functionality overlap. This concept does not exist in our design: if I’m working on two different ideas at the same time, all of the variables between the two scripts would be shared.

The problem lies in the exec() command, which I mentioned was the novel approach earlier. Remember that in Python, everything in the environment (technically a namespace in Python) is just stored as a dict in the __dict__ field (see this post for more information). We can execute code in different environments by doing something like this:

After these code snippet has executed, env['j'] would have a value of 1 stored. Furthermore, any variable in env is able to be used in our code. We can take advantage of this technique to run code in multiple different environments.

First, let’s introduce some boilerplate functionality for creating, deleting, and getting information about a new environments variable (a dict of dicts containing all of the environments for a given environment id).

Now, if I send a POST request to http://localhost:5000/env/create with the POST parameters set to {id: 1}, the server creates a blank dictionary for the environment id and sends me back all environments that have been created. Similarly, I could delete environments or get all available information in the environment.

Hooking this up with our code execution is pretty simple as well.

Note that now, I have taken care to execute each code statement in the environment id provided.

## Error Handling

There is one last, glaringly obvious bug in our code: our design fails miserably when an error occurs. If you had mistyped anything so far in the tutorial, such as sending prnt('hi') to the server, you would have received a solemn 500 error with no extra information from our server. Ideally, we would much rather receive the stack trace on the client side than a response that is so opaque!

Adding error handling to our server is as simple as catching errors and printing the stack trace to standard out. We can get the stacktrace by calling traceback.format_exc(). Since I like to make it blatantly obvious that an error has occurred, I watch for an error to occur, then send back the stacktrace under the ‘error’ key.

We can modify our kernel method slightly to get the functionality we require.

## Final Thoughts

All in all, this code gets us a long way towards creating our own IPython-like server. Writing up a simple frontend to interact back and forth with the JSON-based server is outside the scope of what I was trying to do here, but it certainly isn’t hard.

As for the issues with concurrency and security, many of these could be resolved by the use of Docker containers, which allow sandboxing and could be spun up or broken down as clients connect. This sandboxing would also fix the standard out redirection issue.

Below is the final code. 52 lines of code for a fully functioning, elegant, session-based Python kernel is not too shabby if I do say so myself. Please let me know if you have any other ideas on how to simplify/improve the code.

# Parametric Activation Pools Greatly Increase Performance and Consistency in ConvNets

Currently, I’m writing my master’s thesis on the subject of malleability in deep neural networks — that is, the benefits and detriments of giving a deep neural network more trainable parameters. Obviously weights and biases are trainable parameters, but the recent development of PReLUs introduced some trainable parameters into the activation functions to produce world-class results in image recognition.

I’m be discussing some new malleable constructs in my thesis, including Momentum ReLUs, Activation Pools (APs) and Parametric Activation Pools (PAPs). However this blog post will mostly be focusing on PAPS and their performance.

Note: Consider this research in progress. I acknowledge that there probably isn’t enough experimentation to prove that these concepts generalize well across different datasets and hyperparameters. All thoughts and comments are welcome.

## Theory

The idea behind parametric activation pools is simple:

Each neuron has multiple activation functions instead of just one. Each activation function has a branch parameters $\alpha_1, \cdots, \alpha_n$: in the case of parametric activation pools, these are now trainable parameters. Introducing more parameters into the fold will increase computational complexity but hopefully will allow us to find better fits for our network.

My current intuition on why this works can be most easily explained when viewing the loss function as a function of two weights (on different layers).

In this particular example, I created a small convnet for CIFAR10. These two pictures are representative of the general shapes I get when plotting out the weights. (Note: This is not exhaustive or typical of every weight pair, only the majority.) My activation pool has two activation functions with trainable branches: (1) a ReLU and (2) a step function (for all intents and purposes, the derivative of the ReLU). Why does the PAP have a much clearer, almost trench-like minimum where convergence is very clear? Let’s back up for second and explain a few foundational concepts.

## Momentum ReLU

Imagine for a second that the branches are not trainable: what would you expect the output to look like if, say, 0.5 was going to the ReLU function ($P$) and 0.5 was going to the step function ($D$)? Essentially, we have created a PID controller, which is a common design in electrical control systems. Sparing you all the boring details, one can introduce a signal that is the derivative of your original signal into the mix — this adds more stability/resistance to change in your feedback loop. I was curious as to whether this applies to neural networks, and this appears to show in the empirical results.

For instance, compare this MReLU (Momentum ReLU) with fixed branching rates of 0.8 for $P$ and 0.2 for $D$ (which we expect to have a little more smoothness to it than a full-on ReLU)

to this configuration of the exact same net with 0.5 for $P$ and 0.5 for $D$ (which we expect to be a much more gradual slope)

If you want another way to think about this, the step function is, in some respects, a bias, when adds a constant amount of signal (recall that the derivative of a constant is 0). Thus, if the amount of $P$ and $D$ must equal 1.0 (which they should), the more $D$ you have, the less $P$ you have. Furthermore, because the derivative of $D$ is 0, the less $P$ you have, the smaller gradients you have. Thus, we can scale our rate of change of the derivative function by the constant amount of signal flowing into $D$.

EDIT: another small benefit is in some cases, this network can learn when normal ReLUs can’t. For instance, look at this simulation with a learning rate of 0.35 (very very high):

## Parametric Activation Pool with MReLUs

Now that we have covered the MReLU, assume that the neural network does have control over the branch parameters. In this way, the neural network can adjust the slope ($D \rightarrow 0$ = steep slope, $D \rightarrow 1$ = gradual slope) and the effect of the activation ($D + P \rightarrow 0$ = turn off, vice versa). This allows our network to not only decide which signals are important and which signals are not, but also adjust how quickly each pathway in the neural network converges based on how it effects the loss function.

Let’s take a look at some results:

## MNIST (20 epochs, 94 iters)

Loss Accuracy
Activation Pool 0.0672 +/- 0.0022 0.9788 +/- 0.0007
ReLU 0.1102 +/- 0.0078 0.9671 +/- 0.0024

This is just a taste with some caveats — MNIST isn’t a particularly hard dataset and 20 epochs isn’t particularly high. However, what we can say for certain is that on this dataset, PAPs converge more consistently to a better fit more quickly. I’m currently running some simulations on bigger datasets. I’ll update this post when I have finished those simulations.

## Summary

Here’s everything my research has unearthed in a concise summary:

Momentum ReLUs

• Allow us to have some control how smooth the gradient function is.
• Can learn with very high learning rates because of their smoother nature, even though this is not necessarily the approach you would want to take (doesn’t work out well in practice).

Parametric Activation Pools using MReLU

• Allow the network to determine how important any given pathway is, turning them on or off by adjusting the $\alpha$ parameters.
• Training time is approximately 10% more
• Mean loss is approximately 60% of ReLU loss
• Std loss is approximately 3-4 times more consistent than ReLU

## Final words

This is just the tip of the iceberg for this concept. There are many other considerations, such as how you treat the $\alpha$ parameters (should $\alpha_i$ the same for all nodes in a layer or should each node choose its own value for $\alpha_i$?). That may be another topic for another blog post though.

# Bayes Theorem for Computer Scientists

Few topics have given me as much trouch as Bayes’ theorem over the past couple of years.

I graduated with an undergraduate degree in EE (where calculus reins supreme) and was thrown into probability theory late into my MS coursework. Usually if I stare at a formula long enough, I can understand what’s going on — despite being much lower level math than what I did in EE, I just couldn’t seem to get my head around probability theory. This was especially with Bayes’ theorem. I tried many times and could never really get the idea.

I’m a visual learner, and most concepts in probability theory are expressed in a multitude of different notations and forms. Not only that, but you have to keep track of many different variables that are sometimes so close that they are hard to differentiate between. For instance, is the numerator the probability of A given B or the probability of A and B? What’s the difference between those? Sure if I sat down and thought about it for a while it would become clear — then I would sleep for a night and the concept would become more opaque again.

This article aims to clear up some foundational concepts in probability (and, briefly, how they apply to computer science) as quickly as posssible.

## Probability Theory

• What? Probability theory is a branch of mathematics concerned with random processes (also known as stochastic processes).
• Why? Most phenomena that has yet to happen in the real world can be expressed as probability distributions. Therefore, probability theory can be useful in almost any scenario where we would like to predict something.
• How? The union of probability theory and computer science is a field called probablistic programming. Through probabilitic programming techniques, we can estimate, within a reasonable doubt, the probability that something happens.

## Relevant Theorems

Disclaimer: Statistical junkies would declare this article amiss if I didn’t mention this — all theorems listed here assume that all events are independent and mutually exclusive. All this means is that each event doesn’t affect the probability that the other happens and that each event can’t have more than one outcome (the vast majority of interesting problems fall underneath that definition).

I’ll introduce a problem to help me illustrate my points better.

Assume you have a room full of men and women. 70% of the people are women and 30% are men. Additionally, we know from polling every person that 40% of the women’s favorite color is green and 75% of the men’s favorite color is green.

## Law of total probability

With the law of total probability, we can answer the question “What % people in the room said that their favorite color is green?” Let’s draw this problem in the form of a picture.

Let’s forget about probability theorems for a second. From this picture, how would you figure out how many people said that green was their favorite color? Simple — we can say there is an arbitrary number of people in the room, find out how many men and women there are (based on the percentages given), find out how many how many of each sex chose green as their favorite color (based on the percentages given), and add that amount of people together.

Assume there are 100 people in the room (so 70 women and 30 men). The amount of women that chose green as their favorite color is calculated by equation $70 \cdot 0.4 = 28$ people. Similarly, we can calculate the number of men that liked green as $30 \cdot 0.75 = 22.5$ people. Adding these together, we get $28 + 22.5 = 50.5$, or 50.5% of the total amount of people in the room chose green as their favorite color.

This, in essence, is the law of total probability. The usefulness of the law of total probability is now obvious: Originally, we didn’t know what the overall probability of a person having green as a favorite color (event $B$) was. BUT, we did know the probability that a person was male or female (event $A$) and we also knew the probability of $B$ for each value of $A$ (favorite color percentage of males and females). Thus, we can learn something about the probability of $B$ for any human by adding together the probabilities of each outcome of $A$ if we know that answer for every possible outcome of $A$.

Formally, the equation for the law of total probability is:

$P(B) = \sum \limits_{n} P(B | A_n) \cdot P(A_n)$

in this case

$P(\text{person likes green}) = \sum P(\text{a sex is chosen}) \cdot P(\text{that sex likes green})$

or

$P(\text{person likes green}) = 0.7 \cdot 0.4 + 0.3 \cdot 0.75 = 0.505$

## Bayes’ theorem

Bayes’ theorem is much more powerful — it allows us to understand things about the known world based on the unknown world when we have enough information relating the two. Let’s pick up where we left off in the last problem. We still know everything that was given and now we know some more information based on the law of total probability.

Let’s consider now that we chose a random person from this room, and that the chosen person’s favorite color is green (event $B$). In this example, however, we don’t yet know the sex of person. With Bayes’ theorem, we can answer the question “Given that a randomly selected person likes green, what is the probability that the person is a female?”

Below is an updated figure based on our new knowledge of the situation.

In other words, we know the person was picked within the green circle. What is the probability that the person was picked from the red shaded area? How would you compute this without any notion of probability theory?

My approach would to be to take the ratio of the red area with respect to the entire green area — and that is the approach that Bayes’ theorem takes as well. From the previous work, we know that the amount of women in the green circle (given 100 people) is $70 \cdot 0.4 = 28$ people. Furthermore, we know that the amount of people in the circle is $50.5$ people. So the probability that a female was picked from the people that liked green is $\frac{28}{50.5} \approx 0.55$ or $55\%$.

Formally, Bayes’ theorem is expressed as

$P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)}$

However, the key to really understanding Bayes’ theorem is recognizing that the denominator is actually just the law of total probability! So the equation can also be expressed this way

$P(A | B) = \frac{P(B | A) \cdot P(A)}{\sum \limits_{n} P(A | B_n) \cdot P(B_n)}$

Think of Bayes’ formula this way: the numerator is the section in the green circle that we are focused on, and the denominator is all of the pieces of the green circle (including the piece we are looking at) summed together. So Bayes’ theorem is just a ratio.

Bonus the way I think of Bayes’ theorem is

$P(A | B) = \frac{\text{focused piece of circle}}{\text{focused piece of circle} + \text{rest of the circle pieces}}$

## Application to Computer Science

Briefly, Bayes’ theorem is the foundational theory in the field of Bayesian inference. After establishing a firm method between relating the outcome of a know event in terms of an unknown event, we can now observe the relationship between the two events (and vice-versa). Using Bayes’ rule, we can update our knowledge about how these two events are related. These ideas belong to a broader school of thought called Bayesian statistics which helps us build advanced statistical models using techniques like Markov Chain monte carlo methods and the No-U-Turn sampler. If you would like to try these techniques out, I recommend you use an open source library like PyMC3 instead of coding one up yourself.

## Conclusion

There are many other foundational concepts not covered here like the Union-bound theorem (Boole’s inequality) and the Inclusion-exclusion principle. However, these concepts are mostly useful for building the theorems (including the ones laid out in this article) and not in many practical applications. With a strong understanding of Bayes’ theorem, you are in a good position to dive into the deeper field of probabilistic programming.

• page 1 of 1

#### Clay McLeod

Genomics and ML Software Engineering

Manager Bioinformatics Software Development

Memphis, TN