Orangele's Blog.

Maximum Likelihood Estimation, Maximum A Posterior Estimation and Bayesian Es...

Word count: 1.8kReading time: 11 min
2019/09/07 Share

The basic problem we study in probability:
Given a data generating process, what are the properties of the outcomes?
The basic problem of statistical inference is the inverse of probability:
Given the outcomes, what can we say about the process that generated the data?

The definition above concisely tells the difference between probability and statistics.

In machine learning, we are given the samples and their labels, and want to estimate a model and its parameters which are the most likely to describe the relation between them. Apparently, machine learning problem is a statistic problem. For parameter estimation, there are three common methods usually confusing the machine learning learners:

  • Maximum Likelihood Estimation(MLE)
  • Maximum A Posterior Estimation(MAP)
  • Bayesian Estimation(BE)

In this article I’m going to give a gentle explanation on these three concepts and show how they differ from each others.

Background

First, let’s see some background knowledge.

Bayes’ Theorem

In probability theory and statistics, Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is like this:

  • $P(A|B)$ is a conditional probability, and also called posterior probability. It means the probability of event A occuring given that B is true.
  • $P(B|A)$ is also a conditional probability: the probability of event B occuring given that A is true.
  • $P(A)$ and $P(B)$ are the probability of observing A and B independently of each other. Thay are also called marginal probability or prior probability.

Probability Function & Likelihood Function

For the function $P(x|\theta)$, we can view it in two aspects:

  • If $\theta$ is known and invariable and $x$ is variable, $P(x|\theta)$ is a probability function which represents the probability of observing different $x$ under a fixed $\theta$.
  • If $x$ is known invariable and $\theta$ is variable, then $P(x|\theta)$ is called a likelihood function which represents the probability of observing a specific $x$ under different $\theta$. It can also be written as $L(\theta|x)$.

Conjugate Prior Distributions

In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function.

Here are some common conjugate prior distributions:

Distribution Parameter Conjugate prior
Binomial Probability Beta
Polynomial Probability Dirichlet
Poisson Mean Gamma
Exponential Inverse of mean Gamma
Gaussian Mean Gaussian
Gaussian Variance Inverse-gamma

Beta Distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by $\alpha$ and $\beta$. Its probability density function is:

A random variable $X$ beta-distributied with $\alpha$ and $\beta$ is denoted by: $X \sim Beta(\alpha,\beta)$

A Typical Example

Here is a common and basic example for studying such statistic problems. Let’s assume we have a coin, and we what to estimate the probability $\theta$ for the front side showing in a coin tossing game. We have done 10 experiments, in which front side showed 6 times and back side showed 4 times. We record the results as:

Maximum Likelihood Estimation

MLE is a method trying to find the parameters which maximize the probability of the observed data occuring. For an independently and identically distributed sample set, its likelihood function is:

More specifically, for our coin tossing example, it is:

By solving $\frac{\partial L(X|\theta)}{\partial \theta} =0$, we can get the $\hat\theta$ we want.

To make it easier to solve, we can convert the likelihood function to log-likelihood function:

Then by solving $\frac{\partial l(X|\theta)}{\partial \theta} =0$ we can get $\hat\theta=0.6$. The reason for using log is that log doesn’t change the convex property of the funtion.

However, you may not believe the result because we all know that the probability should be 0.5! But wait, we have no evidence to show that the coin is a totally normal one. Maybe it is not with uniform density? The only TRUTH we have is the experiment result. But now the result is in conflict with our common BELIEF. So, that’s why we need MAP.

Maximum A Posteriori Estimation

Instead of $P(X|\theta)$, the cost function MAP wants to maximize is: .
In MAP, $\theta$ is supposed to be a random variable and obey some distribution. This distribution is a prior distribution, which comes from some assumption or our experience. By maximizing $P(X|\theta)P(\theta)$, we do not only consider the likelihood function $P(X|\theta)$, but also want to increase the probability of the apparence of $\theta$ itself. It is somehow similar to the concept of regularization in machine learning. What is different is that in regularization the regular term is added, but here it is multiplied.

If we transform the cost function $P(X|\theta)P(\theta)$ into $\frac{P(X|\theta)P(\theta)}{P(X)}$, where $P(X)$ is a constant value (got from samples), we can find based on Bayes’ therom $\frac{P(X|\theta)P(\theta)}{P(X)}$ is actually $P(\theta|X)$, which is the posterior probability of $\theta$. This is why we call the method Maximum A Posteriori Estimation.

In our example of tossing coin, we know the probability of $\theta=0.5$ is very high, thus lets assume that $P(\theta)$ is a Gaussian function with mean as 0.5 and variance as 0.1:

As $P(X|\theta)=\theta^6(1-\theta)^4$, we can get:

Solving $\frac{\partial log(P(X|\theta)P(\theta))}{\partial\theta}=0$, we can get $\hat\theta \approx 0.529$.

Now the result seems better, but we need to notice that if we assume $P(\theta)\sim N(0.6,0.1)$, $\hat\theta$ will be 0.6, and if we assume $P(\theta)\sim Beta(3,3)$, $\hat\theta$ will approximately be 0.57. This shows, in MAP a resonable prior distribution is the key factor for a good estimation.

Since after introducing the prior distribution uncertainty increases a lot for the choice of $\theta$, is there any way to estimite the possible distribution of $\theta$ instead of exact value? Yes, that is what Bayesian estimation does.

Bayesian Estimation

Bayasian estimation is a further extension of MAP, where the distribution of parameter $\theta$ is estimated. In BE, $P(X)$ can not be ignored as in MAP. As $X$ is known, the distribution of $\theta$ is $P(\theta|X)$. By Bayes’ formula:

With continuous random variables $\theta$, $P(X)= \int_{\Theta}P(X|\theta)P(\theta)d\theta$. And Bayes’ formula becomes:

The calculation of BE really depends on a suitable choice of $P(\theta)$. For our coin tossing example, we can choose conjugate prior distribution. The conjugate prior distribution for the bionomial distrobution is Beta distribution, so we can assume $P(\theta) \sim Beta(\alpha, \beta)$. The probability density formula for Beta distribution is:

So, the Bayes’ formula becomes:

$Beta(\theta|\alpha+6,\beta+4)$ is the result distribution of BE.

If the result distribution has a infinite mean, we can use the expectation as an estimate for $theta$. For example, if $\alpha=3, \beta=3$, by the expectation formula for Beta distribution $E(\theta)=\frac {\alpha} {\alpha+\beta}$, we can get:

Conclusion

From MLE, MAP to BE, the infomation used is increasing. MLE and MAP both try to find a certain value of $\theta$ to maximize their cost funtion, while BE try to get a distribution of $\theta$. In MLE, we don’t consider the prior distribution of $\theta$, while in MAP and BE we do. In MAP, we have to assume a certain distribution for $\theta$, but in BE we make use of conjugate prior distribution.

Type MLE MAP BE
$\hat{\theta}$ certain value certain value distribution
$f$ $P(X\mid \theta)$ $P(X\mid \theta)P(\theta)$ $\frac {P(X\mid \theta)P(\theta)}{P(X)}$
$P(\theta)$ uniform distribution any reasonable
prior distribution
conjugate
prior distribution

MLE is simple and works well on a lot of random samples, but heavily biases to small set of samples. Solutions for MAP and especially for BE are more complicated, but they are less biased if they have a good prior assumption.

CATALOG
  1. 1. Background
    1. 1.1. Bayes’ Theorem
    2. 1.2. Probability Function & Likelihood Function
    3. 1.3. Conjugate Prior Distributions
    4. 1.4. Beta Distribution
  2. 2. A Typical Example
  3. 3. Maximum Likelihood Estimation
  4. 4. Maximum A Posteriori Estimation
  5. 5. Bayesian Estimation
  6. 6. Conclusion