Chapter 4 Basics of Bayesian Estimation

In this chapter we introduce the basic concepts of Bayesian statistics. First, we start by defining bayesian statistics. We then proceed to introduce basic computation methods such as Gibbs sampler and Metropolis-Hastings algorithm. We use the linear regression model as the leading example for the study of these methods.

4.1 Basics

Consider two random variables \(A\) and \(B\). The rules of probability imply that

\[\begin{equation} p(A,B) = P(A|B)p(B) \end{equation}\] with \(p(A,B)\) as the joint probability of \(A\) and \(B\) occuring. The conditional probability of \(A\) given \(B\) is denoted by \(p(A|B)\). Reversing the equation we get \[\begin{equation} p(A,B)=p(B|A)p(A) \end{equation}\]

Equation these two regressions and rearranging we get Bayes’ rule: \[\begin{equation} p(B|A)=\frac{p(A|B)p(B)}{p(A)}. \end{equation}\]

Bayesian statistics uses Bayes’ rule to infer the parameters of the model. Let \(y\) be the data of the model and \(\theta\) be a vector of parameters for a model that seeks to explain \(y\). Replacing \(B\) with \(\theta\) and \(A\) with \(y\) we obtain

\[\begin{equation} p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} \end{equation}\] The fundamental object of interest for bayesians is the posterior \(p(\theta|y)\), i.e. given the data \(y\) what can we say about the parameters \(\theta\). Since we only want to learn about \(\theta\) we can ignore \(p(y)\), because \(p(y)\) does not involve \(\theta\). Hence, we can write

\[\begin{equation} p(\theta|y)\propto p(y|\theta)p(\theta). \end{equation}\]

We call \(p(\theta|y)\) the posterior density, \(p(y|\theta)\) the likelihood function and \(p(\theta)\) as the prior density.

Hence, a bayesian statistical model consists of three components: A prior on the distribution of model parameters, a likelihood function and a posterior which is proportional to prior times likelihood.

Since the prior is chosen by

4.2 Gibbs Sampler

The Gibbs sampler is a powerful tool for posterior simulation which is used in many econometric models. We will motivate the basic idea in a very general context before returning to the Normal linear regression model with independent Normal-Gamma prior. Accordingly, let us temporarily adopt the general notation where \(\theta\) is a \(p\)-vector of parameters and \(p(y|\theta)\), \(p(\theta)\) and \(p(\theta|y)\) are the likelihood, prior and posterior, respectively. In the linear regression model \(p=k+1\) and \(\theta=(\beta^\intercal,h)^\intercal\). Furthermore, let \(\theta\) be partitioned into various blocks as \(\theta=(\theta^\intercal_{(1)},...,\theta^\intercal_{(B)})^\intercal\), where \(\theta_{(j)}\) is a scalar or vector, \(j=1,2,...,B\). In the linear regression model, it is convenient to set \(B=2\) with \(\theta_{(1)}=\beta\) and \(\theta_{(2)}=h\).

Monte Carlo integration involves taking random draws from \(p(\theta|y)\) and then averaging them to produce estimate of \(E[g(\theta)|y]\) for any function of interest \(g(\theta)\). In many models, including the one discussed in the present chapter, it is not easy to directly draw from \(p(\theta|y)\). In many models, it is not easy to directly draw from \(p(\theta|y)\). However, it often is easy to randomly draw from \(p(\theta_{(1)}|y,\theta_{(2)},...,\theta_{(B)})\), \(p(\theta_{(2)}|y,\theta_{(1)},\theta_{(3)},...,\theta_{(B)})\),…,\(p(\theta_{(B)}|y,\theta_{(1)},...,\theta_{(B-1)})\). The preceding distributions are referred to as full conditional posterior distributions, since they define a posterior for each block conditional on all the other blocks. In the Normal linear regression model with independent Normal-Gamma prior \(p(\beta|y,h)\) is Normal and \(p(h|y,\beta)\) is Gamma, both of which are simple to draw from. It turns out that drawing from the full conditionals will yield a sequence \(\theta^{(1)},\theta^{(2)},...,\theta^{(s)}\) which can be averaged to produce estimates of \(E[g(\theta)|y]\) in the same as did Monte Carlo integration.

The problem with the Gibbs-Sampler is that the initial draw \(\theta^{(0)}_{(1)}\) or \(\theta^{(0)}_{(2)}\), After all, if we knew how to easily take random draws from \(p(\theta_{(2)}|y)\) we could use this and \(p(\theta_{(1)}|y,\theta_{(2)})\) to do Monte Carlo integration and have no need for Gibbs sampling. However, it can be shown that subject to weak conditions, the initial draw \(\theta^{(0)}_{(2)}\) does not matter in the sense that the Gibbs sampler will converge to a sequence of draws from \(p(\theta|y)\). Hence, it is common to choose \(\theta^{(0)}_{(2)}\) in some manner and then run the Gibbs sampler for \(S\) replications. However, the first \(S_0\) of these are discarded as so-called burn-in replications and the remaining \(S_1\) retained for the estimate of \(E[g(\theta)|y]\), where \(S_0+S_1=S\).

The preceding motivation for the Gibbs sampler was written for the case of two blocks, but can be extended in a simple manner to more blocks. Formally, the Gibbs sampler involves the following steps:

Step 0: Choose a starting value, \(\theta^{(0)}\).

For \(s=1,...,S\):

Step 1: Take a random draw, \(\theta^{(s)}_{(1)}\) from \(p\left(\theta_{(1)}|y,\theta^{(s-1)}_{(2)},...,\theta^{(s-1)}_{(B)}\right)\).

Step 2: Take a random draw, \(\theta^{(s)}_{(2)}\) from \(p\left(\theta_{(2)}|y,\theta^{(s)}_{(1)},\theta^{(s-1)}_{(3)}...,\theta^{(s-1)}_{(B)}\right)\).

Step B: Take a random draw \(\theta^{(s)}_{(1)}\) from \(p\left(\theta_{(1)}|y,\theta^{(s)}_{(2)},...,\theta^{(s)}_{(B-1)}\right)\).

Example: Bivariate-Normal Distribution

Consider the bivariate normal density \[\begin{equation} (X,Y)\sim \mathcal{N}\left(0,\left(\begin{array}{cc}1&\rho\\\rho&1\end{array}\right)\right) \end{equation}\] Now, for some starting value \(y_0\) the gibbs sampler can be expressed as

  • Generate \(x_1\) using \(x_1|y_0\sim \mathcal{N}(\rho y_0,1-\rho^2)\)

  • Generate \(y_1\) using \(y_1|x_1\sim \mathcal{N}(\rho x_1,1-\rho^2)\)

  • Generate \(x_{n+1}\) using \(x_{n+1}|y_n\sim \mathcal{N}(\rho y_n,1-\rho^2)\)

  • Generate \(y_{n+1}\) using \(y_{n+1}|x_{n+1}\sim \mathcal{N}(\rho x_{n+1},1-\rho^2)\)

Obviously, one sample directly from a multivariate-normal distribution.

4.3 The Linear Regression Model

Suppose that there is a dependent variable \(y_i\) and \(k\) explanatory variables, \(x_{i1},...,x_{ik}\) for \(i=1,...,N\). Let the first explanatory variable be a constant, i.e. \(x_{i1}=1\) for all \(i=1,...,N\), than the linear regression model can then be written as

\[\begin{equation} y_i=\beta_1+\beta_2x_{i2}+\dots+\beta_kx_{ik}+\epsilon_i. \end{equation}\]

These can be written more compactly in matrix notation as

\[\begin{equation} y=X\beta+\epsilon \end{equation}\] with \[\begin{equation} y=\left[\begin{array}{c}y_1\\y_2\\ \vdots\\y_N\end{array}\right], \end{equation}\] \[\begin{equation} X=\left[\begin{array}{cccc}1&x_{12}&\dots & x_{1k}\\ 1&x_{22}&\dots & x_{2k}\\ \dots&\dots&\dots&\dots\\ 1&x_{N2}&\dots&x_{Nk}\end{array}\right], \end{equation}\] \[\begin{equation} \epsilon = \left[\begin{array}{c}\epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_N\end{array}\right] \end{equation}\] and \[\begin{equation} \beta = \left[\begin{array}{c}\beta_1\\\beta_2\\\vdots\\\beta_k\end{array}\right]. \end{equation}\]

4.4 The likelihood function

The form of the likelihood function is determined by assumptions about \(\epsilon\) and \(X\):

  1. \(\epsilon\) jas a multivariate normal distriution with mean \(0_N\) and a covariance matrix \(\sigma^2I_N\) where \(0_N\) is an N-vector with all elements equal to \(0\) and \(I_N\) is the \(N\times N\) identity matrix. Defining \(h=\sigma^{-2}\) we can write this as \(N(0_N,h^{-1}I_N)\).

  2. Elements of \(X\) are either fixed or, if they are random variables, they are independent of \(\epsilon\).

The covariance matrix is given by

\[\begin{equation} var(\epsilon) = \left[\begin{array}{cccc}h^{-1}&0&\dots&0\\ 0&h^{-1}&\dots&0\\ \dots&\dots&\dots&\dots\\ \dots&\dots&\dots&0\\ 0&\dots&\dots&h^{-1}\end{array}\right] \end{equation}\]

Under these assumptions the likelihood function is

\[\begin{equation} p(y|\beta,h)= \frac{h^{N/2}}{(2\pi)^{N/2}}\left\{\exp\left[-\frac{h}{2}(y-X\beta)'(y-X\beta)\right]\right\} \end{equation}\]

4.5 The prior

Given the likelihood function \(p(y|\beta,h)\) we now need priors for the parameters \(\theta=(\beta,h)\). In particular, we assume that \(p(\beta,h)=p(\beta)p(h)\), i.e. we assume independence between the priors for \(\beta\) and \(h\). Furthermore, we assume that \(p(\beta)\) is Normal and \(p(h)\) is Gamma:

\[\begin{equation} p(\beta)=\frac{1}{(2\pi)^{k/2}}|\underline V|^{-1/2}\exp\left[-\frac{1}{2}(\beta-\underline \beta)'\underline V^{-1}(\beta-\underline\beta)\right] \end{equation}\] and \[\begin{equation} p(h)=c_G^{-1}h^{\frac{\underline v-2}{2}}\exp\left(-\frac{h\underline v}{2\underline s^{-2}}\right) \end{equation}\] where \(c_G\) is the integrating constant for the Gamma pdf. Here the prior mean for \(\beta\) is \(\underline \beta\) and \(\underline V\) is the prior covariance matrix. For the prior of \(h\) we \(\underline s^{-2}\) as the prior mean and \(\underline v\) as the prior degrees of freedom.

4.6 The posterior

The posterior \(p(\beta,h|y)\) is now

\[\begin{equation} p(\beta,h|y)\propto\left\{\exp\left[-\frac{1}{2}\left\{h(y-X\beta)'(y-X\beta)+(\beta-\underline\beta)'\underline V(\beta-\underline\beta)\right\}\right]\right\}h^{\frac{N+\underline v-2}{2}}\exp\left[-\frac{h\underline v}{2\underline s^{-2}}\right] \end{equation}\]

This form of the posterior does not relate to any known probability distribution.

If we treat this posterior as the joint density of \(\beta\) and \(h\), it does not take a convenient form. If we, however, take conditionals we can find simple forms of the posterior. That is, we can obtain \(p(\beta|y,h)\) from the joint posterior for a fixed value of \(h\). Using some matrix manipulations, we can find the key terms as: \[\begin{eqnarray*} &&h(y-X\beta)^\intercal(y-X\beta)+(\beta-\underline{\beta})^\intercal \underline{V}^{-1}(\beta-\underline{\beta})\\ &=&(\beta-\bar{\beta})^\intercal\bar{V}^{-1}(\beta-\bar{\beta})+Q \end{eqnarray*}\] where \[\begin{eqnarray} \bar{V}&=&(\underline{V}^{-1}+hX^\intercal X)^{-1}\\ \bar{\beta}&=&\bar{V}(\underline{V}^{-1}\underline{\beta}+hX^\intercal y) \end{eqnarray}\]

and \[\begin{equation} Q=hy^\intercal y+(\underline{\beta}^\intercal\underline{V}^{-1}\underline{\beta}-\bar{\beta}^\intercal\bar{V}^{-1}\bar{\beta}) \end{equation}\] Plugging this equation into the joint posterior and ignoring the terms that do not involve \(\beta\) (including \(Q\)), we can write \[\begin{equation} p(\beta|y,h)\propto\exp\left[-\frac{1}{2}(\beta-\bar{\beta})^\intercal\bar{V}^{-1}(\beta-\bar{\beta})\right] \end{equation}\] which is the kernel of a multivariate normal density. In other words, \[\begin{equation} \beta|y,h\sim N(\bar{\beta},\bar{V}) \end{equation}\] \(p(h|y,\beta)\) is obtained by treating the joint posterior as a function of \(h\). It can be seen that \[\begin{equation} p(h|y,\beta)\propto h^{\frac{N+\underline{v}-2}{2}}\exp\left[-\frac{h}{2}\left((y-X\beta)^\intercal(y-X\beta)+\underline{v}\underline{s}^2\right)\right] \end{equation}\]

By comparing this with the definition of the Gamma density it can be verified that \[\begin{equation} h|y,\beta\sim G(\bar{s}^{-2},\bar{v}) \end{equation}\] where \[\begin{equation} \bar{v}=N+\underline{v} \end{equation}\] and \[\begin{equation} \bar{s}^2=\frac{(y-X\beta)^\intercal(y-X\beta)+\underline{v}\underline{s}^2}{\bar{v}} \end{equation}\]

Since \(p(\beta,h|y)\neq p(\beta|y,h)p(h|\beta,y)\), the conditional posteriors do not tell us everything about \(p(\beta,h|y)\). Nevertheless, there is a posterior simulator, called the Gibbs sampler which uses conditional posteriors to produce random draws \(\beta^{(s)}\) and \(h^{(s)}\) for \(s=1,...,S\), which can be averaged to produce estimates of posterior properties just as with Monte Carlo integration.

Gibbs Sampler

4.7 Example

We illustrate now the Gibbs sampler with an example.

We use the Boston housing data set. The

library(MASS)
data("Boston")

First, we code the

draw_posterior_beta <- function(x,y,prior_mean,prior_variance,post_h){
  
  post_V <- solve(solve(prior_variance) + post_h * t(x) %*% x)
  post_mean <- post_V %*% (solve(prior_variance) %*% prior_mean + post_h * t(x) %*% y)
  
  post_beta <- mvrnorm(1,mu = post_mean,Sigma = post_V)
  
  return(post_beta)
  
}

draw_post_h <- function(x,y,beta,prior_v,prior_s){
  
  post_v = dim(x)[1] + prior_v
  post_s = (t(y-x %*% beta) %*% (y - x %*% beta) + prior_v * prior_s)/post_v
  post_h <- rgamma(1,shape=post_s,rate=post_v)
  return(post_h)
  
}

Now we look at the summary of the data

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00