Explaining the likelihood, showing its derivation and its use in a demographic context.
In this post, my aim is to explain what the likelihood is and describe how this concept is used in frequentist statistics to obtain an estimator for a parameter. At the end of the post, we will look at a demographic application.
Let’s assume that we are interested in coin flips. Hence, an observation (\(y_i\)) is the outcome of a coin flip and our data \(\{y_1, y_2, .., y_N\}\) consists of all \(N\) coin flips. First thing to do is to assume a distribution for the outcome of a coin flip. The commonly used distribution for a binary observation \(y_i\) taking only the following two values
\[\begin{equation} y_{i} = \begin{cases} 1 & \text{if head}\\ 0 & \text{if tail} \end{cases} \end{equation}\]
is the Bernoulli distribution:
\[ \begin{align} y_i & \sim Bernoulli(p) \\ \Leftrightarrow P(y_i) & = p^{y_i}(1-p)^{1-y_i} = \begin{cases} p & \text{if $y_i$ = 1} \\ (1-p) & \text{if $y_i$ = 0} \end{cases} \end{align} \]
where \(p\) represents the probability of observing head.
Ok that’s fine but we are working with a data set and thus, N coin flips. What is the probability of observing the N coin flips we obtained ? We can easily say that each coin flip is independent and since the coin used is always the same, all flips follow the same identical Bernoulli distribution (the famous i.i.d). These two assumptions are critical (hence their frequent apperearances in stat classes) and allow to express the probability of observing the data as the product of the probability of each coin flip:
\[ \begin{align} P(data) & = \prod^N_{i=1} P(y_i) \\ & = \prod^N_{i=1} p^{y_i}(1-p)^{1-y_i} \\ & = p^{\sum^N_{i=1} y_i}(1-p)^{\sum^N_{i=1} 1-y_i} \\ & = p^{Y}(1-p)^{N-Y} \end{align} \] We need to multiply the above expression by \(\binom{N}{Y}\) to account for the fact that Y heads could have been observed in different order out of N coin flips. The obtained quantity is the likelihood:
\[L(p|Y,N) = \binom{N}{Y}p^Y(1-p)^{N-Y}\]
\(L(p|Y,N)\) is a function of a parameter \(p\) (the probability of heads) given \(Y\) and \(N\) (the data). It represents how likely it is to observe the actual data set according to the possible \(p\) values. Let’s check that visually by assuming that our data consists of 22 heads out of 60 flips.
As it is clear from the above figure, the likelihood is a function of the parameter \(p\) (bounded between 0 and 1 as it is a probability) given the observed data. Despite the fact that it looks like a probability distribution, it is not. The area under the likelihood curve does not sum to one.
Now suppose, as it is commonly the case when performing analyses, that the parameter of interest is not known (here \(p\)). The way that frequentist statistics gets an estimator for \(p\) is by maximizing the probability of observing the data we have. Which, you now know at this stage of the post, is the likelihood. The obtained estimator is called the maximum likelihood estimator (MLE). Looking at the figure, the highest value reached by \(L(p|Y,N)\) seems to be around \(p=0.37\). In order to obtain its precise value, let’s find the argument (the value of \(p\)) that maximizes \(L(p|Y,N)\)
\[p^{MLE} = argmax_p(p^Y(1-p)^{N-Y})\]
The above expression does not contain \(\binom{N}{Y}\) as it does not depend on \(p\) and hence, will not play a role in the optimization. To find the maximum of a function, we take its derivative with respect to the parameter of interest and set it to 0. In practice, it is easier to first transform the likelihood into a log-likelihood \(l(p|Y,N)\) (monotonic transformation) and then find the value of the parameter maximizing it. Let’s do that,
\[ \begin{align} l(p|Y,N) = log(L(p|Y,N)) & = log(p^Y(1-p)^{N-Y}) \\ \Leftrightarrow & = Ylog(p) + (N-Y)log(1-p) \end{align} \]
Taking the derivative and setting it to zero to find the maximum,
\[ \begin{align} & \frac{\partial l(p|Y,N)}{\partial p} = \frac{Y}{p} - \frac{N-Y}{1-p} = 0 \\ & \Leftrightarrow Y(1-p) - (N-Y)p = 0 \\ & \Rightarrow p^{MLE} = \frac{Y}{N} = \frac{22}{60} = 0.366 \end{align} \]
Having the actual data set and willing to estimate the probability of obtaining head, the answer given by the maximum likelihood machinery is \(p^{MLE}=0.366\).
The beauty of this statistical machinery is that it can be applied to more complex examples. Let’s now make the link with models of mortality to conclude this post.
In demography, we commonly assume that deaths at age \(x\) (\(d_x\)) are Poisson distributed
\[d_x \sim Poisson(e_x \mu_x)\]
where \(e_x\) is the number of person at risk of dying at age \(x\) and \(\mu_x\) is the force of mortality. Hence, from this assumption, the probability that deaths at age \(x\) equal \(d_x\) can be expressed as follow
\[P(deaths~at~age~x=d_x) = \frac{(e_x \mu_x)^{d_x}e^{-(e_x \mu_x)}}{d_x!}\]
In general, mortality models assume a parametric form for \(\mu_x\).
How does the likelihood help us here ? It allows us to estimate the parameters of these mortality models. Let’s repeat exactly the same steps as before. We already defined the distribution of the outcome of interest, deaths at age \(x\). This step corresponds to the coin flips example where we said that the outcome of a coin flip had a Bernoulli distribution. However, as before, we are interested in all our data (deaths at all ages). We know that the likelihood is the product of the probability of death at all ages (again assuming i.i.d). Thus, we can write the likelihood of our data as
\[L(\mu_x|d_x, e_x) = \prod^A_{x=0} \frac{(e_x \mu_x)^{d_x}e^{-(e_x \mu_x)}}{d_x!}\]
where \(A\) is the highest age considered in our data set. Until here, we did not define any mortality model. Let’s do that ! In the Gompertz model, the force of mortality is expressed as
\[\mu_x = ae^{bx}\] Hence, the likelihood is obtained by replacing \(\mu_x\) by the above expression in \(L(\mu_x|d_x, e_x)\):
\[L(a, b|d_x, e_x) = \prod^A_{x=0} \frac{(e_x ae^{bx})^{d_x}e^{-(e_x ae^{bx})}}{d_x!}\]
The only thing that remains to be done in order to obtain an estimated value for \(a\) and \(b\) is to find their values maximizing the above likelihood. In comparison to the coin flip example, even after taking the logarithm of \(L(a, b|d_x, e_x)\), we can’t obtain an analytical solution for \(a\) and \(b\) that would maximize the log-likelihood. In this context, we have to use a numerical method such as the Iteratively Reweighted Least-Squares (IRLS) algorithm but this goes outside the scope of this post …
The likelihood is a way to obtain estimates for model’s parameters. Once we have assumed a distribution for our observations (i.e Poisson) and a mathematical expression of our model (ie Gompertz), it can easily be derived given that the i.i.d assumption is fulfilled. In practice, it is frequent that analytical solutions are not available when trying to maximize the likelihood. In such situations, we apply a numerical method (ie IRLS) on the log-likelihood to obtain estimates for parameters of interest.
For attribution, please cite this work as
Schlüter (2022, Feb. 11). Benjamin Schluter: You said the likelihood ?. Retrieved from https://www.benjaminschluter.com/posts/2022-02-11-what-is-the-likelihood/
BibTeX citation
@misc{schlüter2022you, author = {Schlüter, Benjamin}, title = {Benjamin Schluter: You said the likelihood ?}, url = {https://www.benjaminschluter.com/posts/2022-02-11-what-is-the-likelihood/}, year = {2022} }