Few topics have given me as much trouch as Bayes’ theorem over the past couple of years.

I graduated with an undergraduate degree in EE (where calculus reins supreme) and was thrown into probability theory late into my MS coursework. Usually if I stare at a formula long enough, I can understand what’s going on — despite being much lower level math than what I did in EE, I just couldn’t seem to get my head around probability theory. This was *especially* with Bayes’ theorem. I tried many times and could never really get the idea.

I’m a visual learner, and most concepts in probability theory are expressed in a multitude of different notations and forms. Not only that, but you have to keep track of many different variables that are sometimes so close that they are hard to differentiate between. For instance, is the numerator the probability of A **given** B or the probability of A **and** B? What’s the difference between those? Sure if I sat down and thought about it for a while it would become clear — then I would sleep for a night and the concept would become more opaque again.

This article aims to clear up some foundational concepts in probability (and, briefly, how they apply to computer science) as quickly as posssible.

## Probability Theory

**What**?*Probability theory*is a branch of mathematics concerned with random processes (also known as*stochastic processes*).**Why**? Most phenomena that has yet to happen in the real world can be expressed as*probability distributions*. Therefore, probability theory can be useful in almost any scenario where we would like to*predict*something.**How**? The union of probability theory and computer science is a field called*probablistic programming*. Through probabilitic programming techniques, we can estimate, within a reasonable doubt, the probability that something happens.

## Relevant Theorems

**Disclaimer**: Statistical junkies would declare this article amiss if I didn’t mention this — all theorems listed here assume that all events are independent and mutually exclusive. All this means is that each event doesn’t affect the probability that the other happens and that each event can’t have more than one outcome (the vast majority of interesting problems fall underneath that definition).

I’ll introduce a problem to help me illustrate my points better.

Assume you have a room full of men and women. 70% of the people are women and 30% are men. Additionally, we know from polling every person that 40% of the women’s favorite color is green and 75% of the men’s favorite color is green.

## Law of total probability

With the law of total probability, we can answer the question “What % people in the room said that their favorite color is green?” Let’s draw this problem in the form of a picture.

Let’s forget about probability theorems for a second. From this picture, how would you figure out how many people said that green was their favorite color? Simple — we can say there is an arbitrary number of people in the room, find out how many men and women there are (based on the percentages given), find out how many how many of each sex chose green as their favorite color (based on the percentages given), and add that amount of people together.

Assume there are 100 people in the room (so 70 women and 30 men). The amount of women that chose green as their favorite color is calculated by equation $70 \cdot 0.4 = 28$ people. Similarly, we can calculate the number of men that liked green as $30 \cdot 0.75 = 22.5$ people. Adding these together, we get $28 + 22.5 = 50.5$, or 50.5% of the total amount of people in the room chose green as their favorite color.

This, in essence, is the law of total probability. The usefulness of the law of total probability is now obvious: Originally, we didn’t know what the overall probability of a person having green as a favorite color (event $B$) was. **BUT**, we *did* know the probability that a person was male or female (event $A$) and we *also* knew the probability of $B$ for each value of $A$ (favorite color percentage of males and females). Thus, we can learn something about the probability of $B$ for any human by adding together the probabilities of each outcome of $A$ **if** we know that answer for every possible outcome of $A$.

Formally, the equation for the law of total probability is:

$P(B) = \sum \limits_{n} P(B | A_n) \cdot P(A_n)$

in this case

$P(\text{person likes green}) = \sum P(\text{a sex is chosen}) \cdot P(\text{that sex likes green})$

or

$P(\text{person likes green}) = 0.7 \cdot 0.4 + 0.3 \cdot 0.75 = 0.505$

## Bayes’ theorem

Bayes’ theorem is much more powerful — it allows us to understand things about the known world based on the unknown world **when we have enough information relating the two**. Let’s pick up where we left off in the last problem. We still know everything that was given **and** now we know some more information based on the law of total probability.

Let’s consider now that we chose a random person from this room, and that the chosen person’s favorite color is green (event $B$). In this example, however, we **don’t** yet know the sex of person. With Bayes’ theorem, we can answer the question “Given that a randomly selected person likes green, what is the probability that the person is a female?”

Below is an updated figure based on our new knowledge of the situation.

In other words, we know the person was picked within the green circle. What is the probability that the person was picked from the red shaded area? How would you compute this without any notion of probability theory?

My approach would to be to take the ratio of the red area with respect to the entire green area — and that is the approach that Bayes’ theorem takes as well. From the previous work, we know that the amount of women in the green circle (given 100 people) is $70 \cdot 0.4 = 28$ people. Furthermore, we know that the amount of people in the circle is $50.5$ people. So the probability that a female was picked from the people that liked green is $\frac{28}{50.5} \approx 0.55$ or $55\%$.

Formally, Bayes’ theorem is expressed as

$P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)}$

However, the key to really understanding Bayes’ theorem is recognizing that the denominator is actually just the law of total probability! So the equation can also be expressed this way

$P(A | B) = \frac{P(B | A) \cdot P(A)}{\sum \limits_{n} P(A | B_n) \cdot P(B_n)}$

Think of Bayes’ formula this way: the numerator is the section in the green circle that we are focused on, and the denominator is all of the pieces of the green circle (including the piece we are looking at) summed together. So Bayes’ theorem is just a ratio.

**Bonus** the way I think of Bayes’ theorem is

$P(A | B) = \frac{\text{focused piece of circle}}{\text{focused piece of circle} + \text{rest of the circle pieces}}$

## Application to Computer Science

Briefly, Bayes’ theorem is the foundational theory in the field of Bayesian inference. After establishing a firm method between relating the outcome of a know event in terms of an unknown event, we can now observe the relationship between the two events (and vice-versa). Using Bayes’ rule, we can update our knowledge about how these two events are related. These ideas belong to a broader school of thought called Bayesian statistics which helps us build advanced statistical models using techniques like Markov Chain monte carlo methods and the No-U-Turn sampler. If you would like to try these techniques out, I recommend you use an open source library like PyMC3 instead of coding one up yourself.

## Conclusion

There are many other foundational concepts not covered here like the Union-bound theorem (Boole’s inequality) and the Inclusion-exclusion principle. However, these concepts are mostly useful for building the theorems (including the ones laid out in this article) and not in many practical applications. With a strong understanding of Bayes’ theorem, you are in a good position to dive into the deeper field of probabilistic programming.