# Chapter 4 Basics of Bayesian Estimation

In this chapter we introduce the basic concepts of Bayesian statistics. First, we start by defining bayesian statistics. We then proceed to introduce basic computation methods such as Gibbs sampler and Metropolis-Hastings algorithm. We use the linear regression model as the leading example for the study of these methods.

## 4.1 Basics

Consider two random variables \(A\) and \(B\). The rules of probability imply that

\[\begin{equation}
p(A,B) = P(A|B)p(B)
\end{equation}\]
with \(p(A,B)\) as the *joint probability* of \(A\) and \(B\) occuring. The conditional probability of \(A\) given \(B\) is denoted by \(p(A|B)\). Reversing the equation we get
\[\begin{equation}
p(A,B)=p(B|A)p(A)
\end{equation}\]

Equation these two regressions and rearranging we get *Bayes’ rule*:
\[\begin{equation}
p(B|A)=\frac{p(A|B)p(B)}{p(A)}.
\end{equation}\]

Bayesian statistics uses Bayes’ rule to infer the parameters of the model. Let \(y\) be the data of the model and \(\theta\) be a vector of parameters for a model that seeks to explain \(y\). Replacing \(B\) with \(\theta\) and \(A\) with \(y\) we obtain

\[\begin{equation} p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} \end{equation}\] The fundamental object of interest for bayesians is the posterior \(p(\theta|y)\), i.e. given the data \(y\) what can we say about the parameters \(\theta\). Since we only want to learn about \(\theta\) we can ignore \(p(y)\), because \(p(y)\) does not involve \(\theta\). Hence, we can write

\[\begin{equation} p(\theta|y)\propto p(y|\theta)p(\theta). \end{equation}\]

We call \(p(\theta|y)\) the *posterior density*, \(p(y|\theta)\) the likelihood function and \(p(\theta)\) as the prior density.

Hence, a bayesian statistical model consists of three components: A prior on the distribution of model parameters, a likelihood function and a posterior which is proportional to prior times likelihood.

Since the prior is chosen by

## 4.2 Gibbs Sampler

The Gibbs sampler is a powerful tool for posterior simulation which is used in many econometric models. We will motivate the basic idea in a very general context before returning to the Normal linear regression model with independent Normal-Gamma prior. Accordingly, let us temporarily adopt the general notation where \(\theta\) is a \(p\)-vector of parameters and \(p(y|\theta)\), \(p(\theta)\) and \(p(\theta|y)\) are the likelihood, prior and posterior, respectively. In the linear regression model \(p=k+1\) and \(\theta=(\beta^\intercal,h)^\intercal\). Furthermore, let \(\theta\) be partitioned into various *blocks* as \(\theta=(\theta^\intercal_{(1)},...,\theta^\intercal_{(B)})^\intercal\), where \(\theta_{(j)}\) is a scalar or vector, \(j=1,2,...,B\). In the linear regression model, it is convenient to set \(B=2\) with \(\theta_{(1)}=\beta\) and \(\theta_{(2)}=h\).

Monte Carlo integration involves taking random draws from \(p(\theta|y)\) and then averaging them to produce estimate of \(E[g(\theta)|y]\) for any function of interest \(g(\theta)\). In many models, including the one discussed in the present chapter, it is not easy to directly draw from \(p(\theta|y)\). In many models, it is not easy to directly draw from \(p(\theta|y)\). However, it often is easy to randomly draw from \(p(\theta_{(1)}|y,\theta_{(2)},...,\theta_{(B)})\), \(p(\theta_{(2)}|y,\theta_{(1)},\theta_{(3)},...,\theta_{(B)})\),…,\(p(\theta_{(B)}|y,\theta_{(1)},...,\theta_{(B-1)})\). The preceding distributions are referred to as *full conditional posterior distributions*, since they define a posterior for each block conditional on all the other blocks. In the Normal linear regression model with independent Normal-Gamma prior \(p(\beta|y,h)\) is Normal and \(p(h|y,\beta)\) is Gamma, both of which are simple to draw from. It turns out that drawing from the full conditionals will yield a sequence \(\theta^{(1)},\theta^{(2)},...,\theta^{(s)}\) which can be averaged to produce estimates of \(E[g(\theta)|y]\) in the same as did Monte Carlo integration.

The problem with the Gibbs-Sampler is that the initial draw \(\theta^{(0)}_{(1)}\) or \(\theta^{(0)}_{(2)}\), After all, if we knew how to easily take random draws from \(p(\theta_{(2)}|y)\) we could use this and \(p(\theta_{(1)}|y,\theta_{(2)})\) to do Monte Carlo integration and have no need for Gibbs sampling. However, it can be shown that subject to weak conditions, the initial draw \(\theta^{(0)}_{(2)}\) does not matter in the sense that the Gibbs sampler will converge to a sequence of draws from \(p(\theta|y)\). Hence, it is common to choose \(\theta^{(0)}_{(2)}\) in some manner and then run the Gibbs sampler for \(S\) replications. However, the first \(S_0\) of these are discarded as so-called *burn-in replications* and the remaining \(S_1\) retained for the estimate of \(E[g(\theta)|y]\), where \(S_0+S_1=S\).

The preceding motivation for the Gibbs sampler was written for the case of two blocks, but can be extended in a simple manner to more blocks. Formally, the Gibbs sampler involves the following steps:

*Step 0:* Choose a starting value, \(\theta^{(0)}\).

For \(s=1,...,S\):

*Step 1:* Take a random draw, \(\theta^{(s)}_{(1)}\) from \(p\left(\theta_{(1)}|y,\theta^{(s-1)}_{(2)},...,\theta^{(s-1)}_{(B)}\right)\).

*Step 2:* Take a random draw, \(\theta^{(s)}_{(2)}\) from \(p\left(\theta_{(2)}|y,\theta^{(s)}_{(1)},\theta^{(s-1)}_{(3)}...,\theta^{(s-1)}_{(B)}\right)\).

*Step B:* Take a random draw \(\theta^{(s)}_{(1)}\) from \(p\left(\theta_{(1)}|y,\theta^{(s)}_{(2)},...,\theta^{(s)}_{(B-1)}\right)\).

**Example: Bivariate-Normal Distribution**

Consider the bivariate normal density \[\begin{equation} (X,Y)\sim \mathcal{N}\left(0,\left(\begin{array}{cc}1&\rho\\\rho&1\end{array}\right)\right) \end{equation}\] Now, for some starting value \(y_0\) the gibbs sampler can be expressed as

Generate \(x_1\) using \(x_1|y_0\sim \mathcal{N}(\rho y_0,1-\rho^2)\)

Generate \(y_1\) using \(y_1|x_1\sim \mathcal{N}(\rho x_1,1-\rho^2)\)

Generate \(x_{n+1}\) using \(x_{n+1}|y_n\sim \mathcal{N}(\rho y_n,1-\rho^2)\)

Generate \(y_{n+1}\) using \(y_{n+1}|x_{n+1}\sim \mathcal{N}(\rho x_{n+1},1-\rho^2)\)

Obviously, one sample directly from a multivariate-normal distribution.

## 4.3 The Linear Regression Model

Suppose that there is a dependent variable \(y_i\) and \(k\) explanatory variables, \(x_{i1},...,x_{ik}\) for \(i=1,...,N\). Let the first explanatory variable be a constant, i.e. \(x_{i1}=1\) for all \(i=1,...,N\), than the linear regression model can then be written as

\[\begin{equation} y_i=\beta_1+\beta_2x_{i2}+\dots+\beta_kx_{ik}+\epsilon_i. \end{equation}\]

These can be written more compactly in matrix notation as

\[\begin{equation} y=X\beta+\epsilon \end{equation}\] with \[\begin{equation} y=\left[\begin{array}{c}y_1\\y_2\\ \vdots\\y_N\end{array}\right], \end{equation}\] \[\begin{equation} X=\left[\begin{array}{cccc}1&x_{12}&\dots & x_{1k}\\ 1&x_{22}&\dots & x_{2k}\\ \dots&\dots&\dots&\dots\\ 1&x_{N2}&\dots&x_{Nk}\end{array}\right], \end{equation}\] \[\begin{equation} \epsilon = \left[\begin{array}{c}\epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_N\end{array}\right] \end{equation}\] and \[\begin{equation} \beta = \left[\begin{array}{c}\beta_1\\\beta_2\\\vdots\\\beta_k\end{array}\right]. \end{equation}\]

## 4.4 The likelihood function

The form of the likelihood function is determined by assumptions about \(\epsilon\) and \(X\):

\(\epsilon\) jas a multivariate normal distriution with mean \(0_N\) and a covariance matrix \(\sigma^2I_N\) where \(0_N\) is an N-vector with all elements equal to \(0\) and \(I_N\) is the \(N\times N\) identity matrix. Defining \(h=\sigma^{-2}\) we can write this as \(N(0_N,h^{-1}I_N)\).

Elements of \(X\) are either fixed or, if they are random variables, they are independent of \(\epsilon\).

The covariance matrix is given by

\[\begin{equation} var(\epsilon) = \left[\begin{array}{cccc}h^{-1}&0&\dots&0\\ 0&h^{-1}&\dots&0\\ \dots&\dots&\dots&\dots\\ \dots&\dots&\dots&0\\ 0&\dots&\dots&h^{-1}\end{array}\right] \end{equation}\]

Under these assumptions the likelihood function is

\[\begin{equation} p(y|\beta,h)= \frac{h^{N/2}}{(2\pi)^{N/2}}\left\{\exp\left[-\frac{h}{2}(y-X\beta)'(y-X\beta)\right]\right\} \end{equation}\]

## 4.5 The prior

Given the likelihood function \(p(y|\beta,h)\) we now need priors for the parameters \(\theta=(\beta,h)\). In particular, we assume that \(p(\beta,h)=p(\beta)p(h)\), i.e. we assume independence between the priors for \(\beta\) and \(h\). Furthermore, we assume that \(p(\beta)\) is Normal and \(p(h)\) is Gamma:

\[\begin{equation} p(\beta)=\frac{1}{(2\pi)^{k/2}}|\underline V|^{-1/2}\exp\left[-\frac{1}{2}(\beta-\underline \beta)'\underline V^{-1}(\beta-\underline\beta)\right] \end{equation}\] and \[\begin{equation} p(h)=c_G^{-1}h^{\frac{\underline v-2}{2}}\exp\left(-\frac{h\underline v}{2\underline s^{-2}}\right) \end{equation}\] where \(c_G\) is the integrating constant for the Gamma pdf. Here the prior mean for \(\beta\) is \(\underline \beta\) and \(\underline V\) is the prior covariance matrix. For the prior of \(h\) we \(\underline s^{-2}\) as the prior mean and \(\underline v\) as the prior degrees of freedom.

## 4.6 The posterior

The posterior \(p(\beta,h|y)\) is now

\[\begin{equation} p(\beta,h|y)\propto\left\{\exp\left[-\frac{1}{2}\left\{h(y-X\beta)'(y-X\beta)+(\beta-\underline\beta)'\underline V(\beta-\underline\beta)\right\}\right]\right\}h^{\frac{N+\underline v-2}{2}}\exp\left[-\frac{h\underline v}{2\underline s^{-2}}\right] \end{equation}\]

This form of the posterior does not relate to any known probability distribution.

If we treat this posterior as the joint density of \(\beta\) and \(h\), it does not take a convenient form. If we, however, take conditionals we can find simple forms of the posterior. That is, we can obtain \(p(\beta|y,h)\) from the joint posterior for a fixed value of \(h\). Using some matrix manipulations, we can find the key terms as: \[\begin{eqnarray*} &&h(y-X\beta)^\intercal(y-X\beta)+(\beta-\underline{\beta})^\intercal \underline{V}^{-1}(\beta-\underline{\beta})\\ &=&(\beta-\bar{\beta})^\intercal\bar{V}^{-1}(\beta-\bar{\beta})+Q \end{eqnarray*}\] where \[\begin{eqnarray} \bar{V}&=&(\underline{V}^{-1}+hX^\intercal X)^{-1}\\ \bar{\beta}&=&\bar{V}(\underline{V}^{-1}\underline{\beta}+hX^\intercal y) \end{eqnarray}\]

and \[\begin{equation} Q=hy^\intercal y+(\underline{\beta}^\intercal\underline{V}^{-1}\underline{\beta}-\bar{\beta}^\intercal\bar{V}^{-1}\bar{\beta}) \end{equation}\] Plugging this equation into the joint posterior and ignoring the terms that do not involve \(\beta\) (including \(Q\)), we can write \[\begin{equation} p(\beta|y,h)\propto\exp\left[-\frac{1}{2}(\beta-\bar{\beta})^\intercal\bar{V}^{-1}(\beta-\bar{\beta})\right] \end{equation}\] which is the kernel of a multivariate normal density. In other words, \[\begin{equation} \beta|y,h\sim N(\bar{\beta},\bar{V}) \end{equation}\] \(p(h|y,\beta)\) is obtained by treating the joint posterior as a function of \(h\). It can be seen that \[\begin{equation} p(h|y,\beta)\propto h^{\frac{N+\underline{v}-2}{2}}\exp\left[-\frac{h}{2}\left((y-X\beta)^\intercal(y-X\beta)+\underline{v}\underline{s}^2\right)\right] \end{equation}\]

By comparing this with the definition of the Gamma density it can be verified that \[\begin{equation} h|y,\beta\sim G(\bar{s}^{-2},\bar{v}) \end{equation}\] where \[\begin{equation} \bar{v}=N+\underline{v} \end{equation}\] and \[\begin{equation} \bar{s}^2=\frac{(y-X\beta)^\intercal(y-X\beta)+\underline{v}\underline{s}^2}{\bar{v}} \end{equation}\]

Since \(p(\beta,h|y)\neq p(\beta|y,h)p(h|\beta,y)\), the conditional posteriors do not tell us everything about \(p(\beta,h|y)\). Nevertheless, there is a posterior simulator, called the *Gibbs sampler* which uses conditional posteriors to produce random draws \(\beta^{(s)}\) and \(h^{(s)}\) for \(s=1,...,S\), which can be averaged to produce estimates of posterior properties just as with Monte Carlo integration.

**Gibbs Sampler**

## 4.7 Example

We illustrate now the Gibbs sampler with an example.

We use the Boston housing data set. The

First, we code the

```
draw_posterior_beta <- function(x,y,prior_mean,prior_variance,post_h){
post_V <- solve(solve(prior_variance) + post_h * t(x) %*% x)
post_mean <- post_V %*% (solve(prior_variance) %*% prior_mean + post_h * t(x) %*% y)
post_beta <- mvrnorm(1,mu = post_mean,Sigma = post_V)
return(post_beta)
}
draw_post_h <- function(x,y,beta,prior_v,prior_s){
post_v = dim(x)[1] + prior_v
post_s = (t(y-x %*% beta) %*% (y - x %*% beta) + prior_v * prior_s)/post_v
post_h <- rgamma(1,shape=post_s,rate=post_v)
return(post_h)
}
```

Now we look at the summary of the data

```
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
```