8 Terms You Should Know about Bayesian Neural Network

Photo by cyda

Goal

In the last article, we have an introduction to Bayesian Neural Network (BNN). For those who are new to BNN, make sure you have checked the link below so as to get familiar with the difference between Standard Neural Network (SNN) and BNN.

Today, we will jump to the core and learn the mathematical formula behind it. From this article, you will learn different BNN-related terms about…

  1. How we leverage the concept of Bayesian inference to update the probability distribution of model weights and outputs.
  2. What specific loss function we will use for Bayesian Neural Network to optimize the model.
  3. Different techniques and methods in real-life scenarios to tackle the unknown distribution problem.

Bayesian Inference

From the previous article, we know that Bayesian Neural Network would treat the model weights and outputs as variables. Instead of finding a set of optimal estimates, we are fitting the probability distributions for them.

But the problem is “How can we know what their distributions look like?” To answer this, you have to learn what prior, posterior, and Bayes’ theorem are. In the following, we will use an example for illustration. Given there are two classes — science class and art class and the classmates are either wearing glasses or without glasses. And now we pick one random classmate from the classes, can you tell what is the probability of that classmate wearing glasses?

Photo by cyda
1. Prior Probability (Prior)

Prior expresses one’s beliefs before considering any evidence. So without any further information provided, you may guess the probability of the classmate wearing glasses is 0.5 since (30+20)/(30+20+15+35)=50/100=0.5. Here, we will call 0.5 the prior probability.

2. Posterior Probability (Posterior)

Posterior expresses one’s beliefs after considering some evidence. Let’s continue with the above example. What if now I am telling you that the classmate is actually from the Science class? What do you think about the probability of that classmate wearing glasses now? By having more information, you may change your belief and update the probability, right? That updated probability we will call posterior probability.

3. Bayes’ Theorem

Bayes’ theorem is the mathematical formula that is used to update the prior probability to be the posterior probability based on the evidence.

Photo by cyda

A is our interested event which is “the classmate wear glasses” while X is the evidence which is “the classmate is in science class”.

Photo by cyda

So now you understand how the posterior is being updated based on evidence. For Bayesian Neural Network, the posterior probability for the weights will be computed as

Photo by cyda

Loss Function

So you now understand the formula of updating the weights and outputs but we miss one important thing which is the evaluation of the estimated probability distribution. In the following, we will discuss two key measurements that are often used in BNN.

4. Negative Log-Likelihood

For regression problems, we will always use Mean Squared Error (MSE) as the loss function in SNN since we only have a point estimate. However, we will do something different in BNN. By having the predicted distribution, we will use negative log-likelihood as the loss function.

Photo by cyda

Okay, let’s explain them one by one.

Likelihood is the joint probability of the observed data as a function of the predicted distribution. In other words, we want to find out how likely the data would be distributed just like our predicted distribution. The larger the likelihood, the more accurate our predicted distribution.

Photo by cyda

And for log-likelihood, we have it because of easy calculation. By leveraging the log properties (log ab = log a + log b), we can now use summation instead of multiplication.

Last but not least, we add the negative sign to form the negative log-likelihood because in machine learning, we always optimize the objective function by minimizing the cost function or loss function instead of maximizing it.

5. Kullback-Leibler Divergence (KL Divergence)

KL divergence is to quantify how much difference there is from one distribution to another distribution. Let say p is the true distribution while q is the predicted distribution. In fact, it is just equal to cross-entropy between two distributions minus the entropy of the true distribution p. In other words, it explains how much further the predicted distribution q can be improved.

Photo by cyda

For those who have no idea what entropy and cross-entropy are, simply speaking, entropy is the lowest boundary of the “cost” to represent the true distribution p while cross-entropy is the “cost” to represent the true distribution p using the predicted distribution q. Stemming from this, KL divergence will represent how much further the “cost” for the predicted distribution q can be reduced.

So back to today’s focus, p will refer to the true distribution of the model weights and outputs while q will be our predicted distribution. We will use KL divergence to calculate the difference between two distributions so as to update our predicted distribution.


Problem & Solution

Unfortunately, the marginal probability P(D) is in general intractable as it is hard to find the closed form for the below integral. Stemming from this, for a complex system, the posterior P(w | D) is also intractable.

Photo by cyda

To tackle the problem, the statisticians have developed a method called Variational Inference to approximate the true posterior distribution with a surrogate model by minimizing the evidence lower bound.

Don’t worry about the bolded terms. I will explain them one by one.

6. Surrogate

A surrogate model is a simple model that is used to replace the complex model we are interested in. It is easy to work with and as good as the complex model. Generally speaking, a surrogate model would be in the statistical distribution family so we have the analytical solution on it.

Photo by cyda
7. Variational Inference (VI)

Variational Inference is the concept to use a variational distribution q* to replace the true posterior distribution p(w|D). But there are so many surrogate models, how can we ensure q* is good enough to represent p(w|D)? The answer is simple, we can use the KL Divergence just learned.

Among the surrogate models Q, we are trying to find the optimal one q* that

Photo by cyda
8. Evidence Lower Bound (ELBO)

However, the same problem still exists since we do not have the posterior probability distribution. What can we do is to rewrite the KL divergence into

Photo by cyda

By considering

Photo by cyda

We can summarize the below formula.

Photo by cyda

Given the knowledge that KL divergence is a non-negative number and the evidence is the probability between 0 and 1 and therefore the log evidence must be a non-positive number, we can easily deduce that L(w) is the lower bound of the evidence. This is why we call it “Evidence Lower Bound”. In other words, we can now find the optimal one q* by optimizing

Photo by cyda

Conclusion

Congratulations if you have finished reading all the contents above! Hopefully you are now having the most fundamental understanding of the mathematics concepts behind the Bayesian Neural Network. In the upcoming articles, I will focus more on the coding perspective about how to use TensorFlow Probability to build the BNN model. Stay tuned! =)

Comments