The concept of the reference genome is an important one in bioinformatics: before we can understand variation in an individual’s genome, we must first have a reference that we compare against (akin to the picture on the front of a puzzle box). At the time of writing, the most up to date version of the reference genome is GRCh38.p13. The first version of GRCh38
debuted in 2013, but it has been receiving semiregular patches since that time (thus, the .p13
for “patch 13”).
Though this reference included many improvements, the community was relatively slow in migrating from GRCh37
(introduced in 2009). This was likely for variety of reasons, but maybe most importantly because analysis pipelines are built around the particular version of the reference genome used: any update to the genome requires tens/hundreds/thousands of other dependent reference files to be updated as well, endtoend concordance checks to ensure the results aren’t affected negatively by the change (often time consuming and complicated to interpret), and then potentially reprocessing your entire cohort (time consuming and computationally expensive).
GRCh38 defined the most complete reference genome to date, including sequencebased representations of centromeres and additional alternate loci representative of a more diverse set of the population. Unfortunately, using the genome as published complicates many bioinformatics analyses, so most projects use the modified GRCh38_no_alt
analysis set. In this build of the genome, (1) the alternate haplotypes are not included, (2) the pseudoautosomal regions on chromosome Y + many centromeric regions have been omitted by hard masking the genome, and (3) the EBV genome was included. This build limits some of the benefits that GRCh38 promised (particularly the alt haplotypes), but as Heng Li pointed out in 2017, there is no perfect reference genome and the no alt analysis set appears to be the best option for now. I’ll also note here that the no alt analysis set is not updated with the new patches of the reference genome, so the years of information we’ve learned about the genome since 2013 is not being leveraged in projects that use it.
Fast forward two years later: all of the major projects I surveyed use the base GRCh38 no alt analysis set as their reference genome, but the number of other sequences or genomes added to that base set varies widely. You can view the sequence dictionaries I considered for yourself, but just eyeballing the data shows:
Many tools in bioinformatics expect the sequence dictionary to be exactly the same across samples that are analyzed. This means that any analysis wishing to combine samples from multiple sources will have to rerun alignment from scratch to get things working. With the world generating more nextgeneration sequencing data than ever before (both in terms of quantity and data per sample such as deep wholegenome sequencing), this is increasingly problematic. For instance, we currently have 11,589 wholegenome BAM files in St. Jude Cloud. At a cost of 30100 dollars per sample to process in the cloud (alignment + variant calling), you’re looking at a 500,000+ dollar bill to reprocess them. Suffice it to say, that’s not something most research labs can afford to do. Our goal on that project is to serve the community of bioinformatics wishing to use our data, so we think about this problem a lot.
GRCh38
were a step forward, but these loci continue to complicate bioinformatics analyses, so they aren’t leveraged in the projects I surveyed.For takeaways, the only thought I have is whether it’s possible to unify the set of nonhuman sequences used across some major projects. GRCh39
is indefinitely paused according to the notice currently on the Genome Reference Consortium’s page, but it was recently announced that a new multimillion dollar grant was awarded to expand the genetic diversity in the human genome. Hopefully these thoughts will be considered on our next pass.
Even the most academic among us mistakenly merge two very different schools of thought in our discussions on deep learning:
Much of the debating going on is surprisingly still concerned with the first point instead of the second. Let’s be clear — the inspiration for, benefits of, and detriments against neural networks are all well documented in the literature. Why are we still talking about this like the discussion is new? Nothing is more frustrating when discussing deep learning that someone explaining their views on why deep neural networks are “modeled after how the human brain works” and thus are the key to unlocking artificial general intelligence. This is an obvious straw man, since this discussion is essentially the same as was produced when vanilla neural networks were introduced.
The idea I’d like for you to take away here is that we are not asking the right question for the answer which we desire. If we want to know how one can contextualize deep neural networks in the everincreasing artificially intelligent world, we must answer the following question: what does increasing computing power and adding layers to a neural network actually allow us to do better than a normal neural network? Answering these could yield a fruitful discussion on deep learning.
If the first question is wornout, let’s take on the second question: I believe that deep neural networks are more useful than traditional neural networks for three reasons:
At the risk of sounding bold, that’s largely it. These are the only three benefits that I can recall in my time working with deep learning.
Assuming my above statements were true, what would we expect to see in the deep learning landscape? We might expect that deep neural networks would be most useful in learning problems where the data has some spatial qualities that can be exploited, such as image data, audio data, natural language processing, etc. Although we might say there are many areas that could benefit from that spatial exploitation, we would certainly not find that this algorithm was a magical cure for any data that you throw at it. We might find that deep learning helps selfdriving cars perceive their environment through visual and radarbased sensory input, but not a network that can decide whether to protect it’s own driver or the pedestrian walking the street. Those that read the paper on AlphaGo will note that deep learning was simply a tool used by traditional AI algorithms.
Since I am feeling especially bold today, I will make another prediction: deep learning alone will not produce artificial general intelligence. There is simply not enough there to create such a complex system. I do think it’s not unreasonable to expect it will be used as a sensory processing system leveraged by more traditional artificial intelligence systems.
Stop studying deep learning thinking it will lay all other algorithms to waste. Stop throwing deep learning at every dataset you see. Start experimenting with these technologies outside of the “hello world” examples in the packages you use — you will quickly learn what they are actually useful for. Most of all, let’s stop viewing deep learning as a proof that we have almost achieved AGI and start viewing it for what it truly is: a tool that is useful in assisting a computer’s ability to perceive.
]]>Since this is not a production server, Flask is perfect for our needs. Let’s start with a simple Flask server that does nothing. I’ll include some imports we will need later.
1  import sys 
There is really only one magical piece to cover here: how does Python take a string of code, execute it, then return the output? Let’s start with the novel approach.
You can execute any Python statement using the exec()
command. I’m going to create a Flask endpoint that takes a POST parameter named ‘code’, splits the command by newlines, and runs each command in sequence. Here is what the code looks like.
1  import sys 
Easy enough! You already have a minimal, Pythonexecuting server in 15 lines of code (including unused imports and correct spacing). To test this, I use the POSTMAN client to hit my local server with POST requests.
Send a POST request to http:localhost:5000/
with the POST parameter ‘code’ set to print('hello world')
like the picture below and hit ‘Send’. As expected, the server reads, the code, prints out ‘Hello world’, then exits.
This isn’t very useful to us yet — although the server successfully receives and executes the code, the client only receives a “Success” message. Ideally, we would want to redirect the output from the program executing back to the client. To achieve this, we must capture what is being written to standard out buffer into a string buffer and return this string to the client. After some research, I determined this could be done by temporarily redirecting standard out to a StringIO buffer, like so:
1 

Looking at the output from the Postman Client, we can see that the server is now relaying back the stdout to the client as expected.
Note: Redirecting standard out in this way will redirect the output for all clients connecting. Thus, if you have multiple people running code at the exact same time, the outputs will overlap. Don’t do this. That’s why I noted this was not a production ready server.
There is another major problem in our implementation — everything is executed in the same environment. One of the nice things about IPython is that you can work in several different notebooks at the same time, and none of the variables or functionality overlap. This concept does not exist in our design: if I’m working on two different ideas at the same time, all of the variables between the two scripts would be shared.
The problem lies in the exec()
command, which I mentioned was the novel approach earlier. Remember that in Python, everything in the environment (technically a namespace in Python) is just stored as a dict in the __dict__
field (see this post for more information). We can execute code in different environments by doing something like this:
1  env = {} 
After these code snippet has executed, env['j']
would have a value of 1
stored. Furthermore, any variable in env
is able to be used in our code. We can take advantage of this technique to run code in multiple different environments.
First, let’s introduce some boilerplate functionality for creating, deleting, and getting information about a new environments
variable (a dict of dicts containing all of the environments for a given environment id).
1  environments = {} 
Now, if I send a POST request to http://localhost:5000/env/create
with the POST parameters set to {id: 1}
, the server creates a blank dictionary for the environment id and sends me back all environments that have been created. Similarly, I could delete environments or get all available information in the environment.
Hooking this up with our code execution is pretty simple as well.
1 

Note that now, I have taken care to execute each code statement in the environment id provided.
There is one last, glaringly obvious bug in our code: our design fails miserably when an error occurs. If you had mistyped anything so far in the tutorial, such as sending prnt('hi')
to the server, you would have received a solemn 500 error with no extra information from our server. Ideally, we would much rather receive the stack trace on the client side than a response that is so opaque!
Adding error handling to our server is as simple as catching errors and printing the stack trace to standard out. We can get the stacktrace by calling traceback.format_exc()
. Since I like to make it blatantly obvious that an error has occurred, I watch for an error to occur, then send back the stacktrace under the ‘error’ key.
We can modify our kernel method slightly to get the functionality we require.
1 

All in all, this code gets us a long way towards creating our own IPythonlike server. Writing up a simple frontend to interact back and forth with the JSONbased server is outside the scope of what I was trying to do here, but it certainly isn’t hard.
As for the issues with concurrency and security, many of these could be resolved by the use of Docker containers, which allow sandboxing and could be spun up or broken down as clients connect. This sandboxing would also fix the standard out redirection issue.
Below is the final code. 52 lines of code for a fully functioning, elegant, sessionbased Python kernel is not too shabby if I do say so myself. Please let me know if you have any other ideas on how to simplify/improve the code.
1  import sys 
I’m be discussing some new malleable constructs in my thesis, including Momentum ReLUs, Activation Pools (APs) and Parametric Activation Pools (PAPs). However this blog post will mostly be focusing on PAPS and their performance.
Note: Consider this research in progress. I acknowledge that there probably isn’t enough experimentation to prove that these concepts generalize well across different datasets and hyperparameters. All thoughts and comments are welcome.
The idea behind parametric activation pools is simple:
Each neuron has multiple activation functions instead of just one. Each activation function has a branch parameters $\alpha_1, \cdots, \alpha_n$: in the case of parametric activation pools, these are now trainable parameters. Introducing more parameters into the fold will increase computational complexity but hopefully will allow us to find better fits for our network.
My current intuition on why this works can be most easily explained when viewing the loss function as a function of two weights (on different layers).
In this particular example, I created a small convnet for CIFAR10. These two pictures are representative of the general shapes I get when plotting out the weights. (Note: This is not exhaustive or typical of every weight pair, only the majority.) My activation pool has two activation functions with trainable branches: (1) a ReLU and (2) a step function (for all intents and purposes, the derivative of the ReLU). Why does the PAP have a much clearer, almost trenchlike minimum where convergence is very clear? Let’s back up for second and explain a few foundational concepts.
Imagine for a second that the branches are not trainable: what would you expect the output to look like if, say, 0.5 was going to the ReLU function ($P$) and 0.5 was going to the step function ($D$)? Essentially, we have created a PID controller, which is a common design in electrical control systems. Sparing you all the boring details, one can introduce a signal that is the derivative of your original signal into the mix — this adds more stability/resistance to change in your feedback loop. I was curious as to whether this applies to neural networks, and this appears to show in the empirical results.
For instance, compare this MReLU (Momentum ReLU) with fixed branching rates of 0.8 for $P$ and 0.2 for $D$ (which we expect to have a little more smoothness to it than a fullon ReLU)
to this configuration of the exact same net with 0.5 for $P$ and 0.5 for $D$ (which we expect to be a much more gradual slope)
If you want another way to think about this, the step function is, in some respects, a bias, when adds a constant amount of signal (recall that the derivative of a constant is 0). Thus, if the amount of $P$ and $D$ must equal 1.0 (which they should), the more $D$ you have, the less $P$ you have. Furthermore, because the derivative of $D$ is 0, the less $P$ you have, the smaller gradients you have. Thus, we can scale our rate of change of the derivative function by the constant amount of signal flowing into $D$.
EDIT: another small benefit is in some cases, this network can learn when normal ReLUs can’t. For instance, look at this simulation with a learning rate of 0.35 (very very high):
Now that we have covered the MReLU, assume that the neural network does have control over the branch parameters. In this way, the neural network can adjust the slope ($D \rightarrow 0$ = steep slope, $D \rightarrow 1$ = gradual slope) and the effect of the activation ($D + P \rightarrow 0$ = turn off, vice versa). This allows our network to not only decide which signals are important and which signals are not, but also adjust how quickly each pathway in the neural network converges based on how it effects the loss function.
Let’s take a look at some results:
Loss  Accuracy  

Activation Pool  0.0672 +/ 0.0022  0.9788 +/ 0.0007 
ReLU  0.1102 +/ 0.0078  0.9671 +/ 0.0024 
This is just a taste with some caveats — MNIST isn’t a particularly hard dataset and 20 epochs isn’t particularly high. However, what we can say for certain is that on this dataset, PAPs converge more consistently to a better fit more quickly. I’m currently running some simulations on bigger datasets. I’ll update this post when I have finished those simulations.
Here’s everything my research has unearthed in a concise summary:
Momentum ReLUs
Parametric Activation Pools using MReLU
This is just the tip of the iceberg for this concept. There are many other considerations, such as how you treat the $\alpha$ parameters (should $\alpha_i$ the same for all nodes in a layer or should each node choose its own value for $\alpha_i$?). That may be another topic for another blog post though.
]]>I graduated with an undergraduate degree in EE (where calculus reins supreme) and was thrown into probability theory late into my MS coursework. Usually if I stare at a formula long enough, I can understand what’s going on — despite being much lower level math than what I did in EE, I just couldn’t seem to get my head around probability theory. This was especially with Bayes’ theorem. I tried many times and could never really get the idea.
I’m a visual learner, and most concepts in probability theory are expressed in a multitude of different notations and forms. Not only that, but you have to keep track of many different variables that are sometimes so close that they are hard to differentiate between. For instance, is the numerator the probability of A given B or the probability of A and B? What’s the difference between those? Sure if I sat down and thought about it for a while it would become clear — then I would sleep for a night and the concept would become more opaque again.
This article aims to clear up some foundational concepts in probability (and, briefly, how they apply to computer science) as quickly as posssible.
Disclaimer: Statistical junkies would declare this article amiss if I didn’t mention this — all theorems listed here assume that all events are independent and mutually exclusive. All this means is that each event doesn’t affect the probability that the other happens and that each event can’t have more than one outcome (the vast majority of interesting problems fall underneath that definition).
I’ll introduce a problem to help me illustrate my points better.
Assume you have a room full of men and women. 70% of the people are women and 30% are men. Additionally, we know from polling every person that 40% of the women’s favorite color is green and 75% of the men’s favorite color is green.
With the law of total probability, we can answer the question “What % people in the room said that their favorite color is green?” Let’s draw this problem in the form of a picture.
Let’s forget about probability theorems for a second. From this picture, how would you figure out how many people said that green was their favorite color? Simple — we can say there is an arbitrary number of people in the room, find out how many men and women there are (based on the percentages given), find out how many how many of each sex chose green as their favorite color (based on the percentages given), and add that amount of people together.
Assume there are 100 people in the room (so 70 women and 30 men). The amount of women that chose green as their favorite color is calculated by equation $70 \cdot 0.4 = 28$ people. Similarly, we can calculate the number of men that liked green as $30 \cdot 0.75 = 22.5$ people. Adding these together, we get $28 + 22.5 = 50.5$, or 50.5% of the total amount of people in the room chose green as their favorite color.
This, in essence, is the law of total probability. The usefulness of the law of total probability is now obvious: Originally, we didn’t know what the overall probability of a person having green as a favorite color (event $B$) was. BUT, we did know the probability that a person was male or female (event $A$) and we also knew the probability of $B$ for each value of $A$ (favorite color percentage of males and females). Thus, we can learn something about the probability of $B$ for any human by adding together the probabilities of each outcome of $A$ if we know that answer for every possible outcome of $A$.
Formally, the equation for the law of total probability is:
Bayes’ theorem is much more powerful — it allows us to understand things about the known world based on the unknown world when we have enough information relating the two. Let’s pick up where we left off in the last problem. We still know everything that was given and now we know some more information based on the law of total probability.
Let’s consider now that we chose a random person from this room, and that the chosen person’s favorite color is green (event $B$). In this example, however, we don’t yet know the sex of person. With Bayes’ theorem, we can answer the question “Given that a randomly selected person likes green, what is the probability that the person is a female?”
Below is an updated figure based on our new knowledge of the situation.
In other words, we know the person was picked within the green circle. What is the probability that the person was picked from the red shaded area? How would you compute this without any notion of probability theory?
My approach would to be to take the ratio of the red area with respect to the entire green area — and that is the approach that Bayes’ theorem takes as well. From the previous work, we know that the amount of women in the green circle (given 100 people) is $70 \cdot 0.4 = 28$ people. Furthermore, we know that the amount of people in the circle is $50.5$ people. So the probability that a female was picked from the people that liked green is $\frac{28}{50.5} \approx 0.55$ or $55\%$.
Formally, Bayes’ theorem is expressed as
Bonus the way I think of Bayes’ theorem is
Briefly, Bayes’ theorem is the foundational theory in the field of Bayesian inference. After establishing a firm method between relating the outcome of a know event in terms of an unknown event, we can now observe the relationship between the two events (and viceversa). Using Bayes’ rule, we can update our knowledge about how these two events are related. These ideas belong to a broader school of thought called Bayesian statistics which helps us build advanced statistical models using techniques like Markov Chain monte carlo methods and the NoUTurn sampler. If you would like to try these techniques out, I recommend you use an open source library like PyMC3 instead of coding one up yourself.
There are many other foundational concepts not covered here like the Unionbound theorem (Boole’s inequality) and the Inclusionexclusion principle. However, these concepts are mostly useful for building the theorems (including the ones laid out in this article) and not in many practical applications. With a strong understanding of Bayes’ theorem, you are in a good position to dive into the deeper field of probabilistic programming.
]]>