Script 1 - Bernoulli - Beta

Author

Roberto Ascari

Posterior distribution:

In this section, we will explore the Bernoulli-Beta model applied to the cardiac dataset. Specifically, we assume that \(Y_i | \theta \sim Bernoulli(\theta)\) and that \(\theta \sim Beta(a, b)\). With these choices, we showed that the posterior distribution is \[\theta | \textbf{y} \sim Beta\left(a + \sum_{i=1}^n y_i, b + n - \sum_{i=1}^n y_i\right).\]

We will examine four different scenarios, each corresponding to a distinct choice of hyperparameters for the Beta prior.

cardiac <- read.table("data/cardiac.csv", header=T, sep=";")
str(cardiac)
'data.frame':   100 obs. of  2 variables:
 $ Age: int  20 23 24 25 25 26 26 28 28 29 ...
 $ Chd: int  0 0 0 0 1 0 0 0 0 0 ...
y <- cardiac$Chd
table(y)
y
 0  1 
57 43 
sum(y)
[1] 43
cardiac <- read.table("data/cardiac.csv", header=T, sep=";")
str(cardiac)
'data.frame':   100 obs. of  2 variables:
 $ Age: int  20 23 24 25 25 26 26 28 28 29 ...
 $ Chd: int  0 0 0 0 1 0 0 0 0 0 ...
y <- cardiac$Chd
table(y)
y
 0  1 
57 43 
sum(y)
[1] 43
length(y)
[1] 100
sum(y)
[1] 43
length(y)
[1] 100

These are the four scenarios we consider:

  • A: \(\theta \sim Beta(1,1)\);

  • B: \(\theta \sim Beta(10,10)\);

  • C: \(\theta \sim Beta(10,5)\);

  • D: \(\theta \sim Beta(5,10)\).

par(mfrow=c(2,2))
curve(dbeta(x, 1, 1), main="Scenario A", xlab=expression(theta), ylab=expression(pi(theta)))
curve(dbeta(x, 10, 10), main="Scenario B", xlab=expression(theta), ylab=expression(pi(theta)))
curve(dbeta(x, 10, 5), main="Scenario C", xlab=expression(theta), ylab=expression(pi(theta)))
curve(dbeta(x, 5, 10), main="Scenario D", xlab=expression(theta), ylab=expression(pi(theta)))

par(mfrow=c(1,1))

We further define the L_binom function, which returns the value of the likelihood function evaluated in a specific theta point.

L_Binom <- function(y, theta){
  s <- sum(y)
  n <- length(y)
  L <- theta^s * (1-theta)^(n-s)
  return(L)
}

L_Binom(y, theta=0.001)
[1] 9.445671e-130

Obviously, the likelihood function does not depend on the prior distribution we impose on \(\theta\). In this case, the MLE corresponds to the sample mean (blue dashed line).

curve(L_Binom(y, theta=x), main="Likelihood",
      xlab=expression(theta), lwd=2)
abline(v=mean(y), col="#0072B2", lty="dashed")

Scenario A:

In this scenario, the prior distribution coincides with a uniform distribution on the \((0,1)\) interval. Thus, the prior mean (red dashed line) is equal to 0.5.

a1 <- 1; b1 <- 1

curve(dbeta(x, a1, b1), main="Prior A", ylim=c(0,1.4),
      xlab=expression(theta), lwd=2)
abline(v=a1/(a1+b1), col="#D55E00", lty="dashed")

The posterior distribution is \(\theta | \textbf{y} \sim Beta(44, 58)\).

n <- length(y)
a1.post <- a1 + sum(y)
b1.post <- b1 + n - sum(y)

curve(dbeta(x, a1.post, b1.post), main="Scenario A",
      xlab=expression(theta), lwd=2)
curve(dbeta(x, a1, b1), lty="dashed", add=T, lwd=2)
abline(v=a1/(a1+b1), col="#D55E00", lty="dashed", lwd=2)
abline(v=mean(y), col="#009E73", lty="dotted", lwd=2)
abline(v=a1.post/(a1.post+b1.post), col="#0072B2", lty="dashed", lwd=2)
legend(.6,8, c("Prior", "Posterior", "Prior Mean", "MLE", "Post. Mean"), lwd=2,
       col=c("black", "black", "#D55E00", "#009E73", "#0072B2"), lty=c(2,1,2,3,2), bty="n")

The chosen prior distribution does not favor any particular value of \(\theta\). Thus, we selected a non-informative prior. As a result, the posterior distribution is centered around the maximum likelihood estimate (MLE).

Scenario B:

In this scenario, we use a prior distribution for \(\theta\) that is still symmetric (meaning it treats values below and above 0.5 in the same way). However, unlike the uniform prior, it is not flat: it expresses a preference for values closer to 0.5.

a2 <- 10; b2 <- 10

curve(dbeta(x, a2, b2), main="Prior B",
      xlab=expression(theta), lwd=2)
abline(v=a2/(a2+b2), col="#D55E00", lty="dashed")

n <- length(y)
a2.post <- a2 + sum(y)
b2.post <- b2 + n - sum(y)

curve(dbeta(x, a2.post, b2.post), main="Scenario B",
      xlab=expression(theta), lwd=2)
curve(dbeta(x, a2, b2), lty="dashed", add=T, lwd=2)
abline(v=a2/(a2+b2), col="#D55E00", lty="dashed", lwd=2)
abline(v=mean(y), col="#009E73", lty="dotted", lwd=2)
abline(v=a2.post/(a2.post+b2.post), col="#0072B2", lty="dashed", lwd=2)
legend(.6,8, c("Prior", "Posterior", "Prior Mean", "MLE", "Post. Mean"), lwd=2,
       col=c("black", "black", "#D55E00", "#009E73", "#0072B2"), lty=c(2,1,2,3,2), bty="n")

Since we are still considering a symmetric prior, the prior mean is equal to 0.5. However, unlike the uniform prior, it does not assign the same probability density to every value or interval of \(\theta\). As a consequence, the posterior mean becomes a weighted average between the prior mean and the MLE.

Scenario C & D:

Scenarios C and D consider asymmetric prior distributions.

a3 <- 10; b3 <- 5

curve(dbeta(x, a3, b3), main="Prior C",
      xlab=expression(theta), lwd=2)
abline(v=a3/(a3+b3), col="#D55E00", lty="dashed")

n <- length(y)
a3.post <- a3 + sum(y)
b3.post <- b3 + n - sum(y)

curve(dbeta(x, a3.post, b3.post), main="Scenario C",
      xlab=expression(theta), lwd=2)
curve(dbeta(x, a3, b3), lty="dashed", add=T, lwd=2)
abline(v=a3/(a3+b3), col="#D55E00", lty="dashed", lwd=2)
abline(v=mean(y), col="#009E73", lty="dotted", lwd=2)
abline(v=a3.post/(a3.post+b3.post), col="#0072B2", lty="dashed", lwd=2)
legend(.7,8, c("Prior", "Posterior", "Prior Mean", "MLE", "Post. Mean"), lwd=2,
       col=c("black", "black", "#D55E00", "#009E73", "#0072B2"), lty=c(2,1,2,3,2), bty="n")

a4 <- 5; b4 <- 10

curve(dbeta(x, a4, b4), main="Prior D",
      xlab=expression(theta), lwd=2)
abline(v=a4/(a4+b4), col="#D55E00", lty="dashed")

n <- length(y)
a4.post <- a4 + sum(y)
b4.post <- b4 + n - sum(y)

curve(dbeta(x, a4.post, b4.post), main="Scenario D",
      xlab=expression(theta), lwd=2)
curve(dbeta(x, a4, b4), lty="dashed", add=T, lwd=2)
abline(v=a4/(a4+b4), col="#D55E00", lty="dashed", lwd=2)
abline(v=mean(y), col="#009E73", lty="dotted", lwd=2)
abline(v=a4.post/(a4.post+b4.post), col="#0072B2", lty="dashed", lwd=2)
legend(.6,8, c("Prior", "Posterior", "Prior Mean", "MLE", "Post. Mean"), lwd=2,
       col=c("black", "black", "#D55E00", "#009E73", "#0072B2"), lty=c(2,1,2,3,2), bty="n")

Point Estimates:

We may summarize the (prior and) posterior distribution by some measures:

Beta.Exp <- function(a, b) a/(a+b)
Beta.Var <- function(a, b)(a*b)/((a+b)^2*(a+b+1))
Beta.Mode <- function(a, b){
  ifelse(a>1 & b>1, (a-1)/(a+b-2), NA)
}
Beta.Median <- function(a, b) qbeta(.5, a, b)
measures <- matrix(NA, ncol = 6, nrow = 4)
rownames(measures) <- c("A", "B", "C", "D")
colnames(measures) <- c("a.post", "b.post", "Exp", "Mode", "Median", "Var")

measures[,1] <- c(a1.post, a2.post, a3.post, a4.post)
measures[,2] <- c(b1.post, b2.post, b3.post, b4.post)

for(i in 1:4){
  a <- measures[i,1]
  b <- measures[i,2]
  
  measures[i,3] <- Beta.Exp(a, b)
  measures[i,4] <- Beta.Mode(a, b)
  measures[i,5] <- Beta.Median(a, b)
  measures[i,6] <- Beta.Var(a, b)
}

round(measures,4)
  a.post b.post    Exp   Mode Median    Var
A     44     58 0.4314 0.4300 0.4309 0.0024
B     53     67 0.4417 0.4407 0.4413 0.0020
C     53     62 0.4609 0.4602 0.4606 0.0021
D     48     67 0.4174 0.4159 0.4169 0.0021

In this simple setting, even when using different priors, the resulting posterior distributions do not lead to substantially different point estimates. This happens because the empirical evidence (i.e., the information provided by the sample y) is much stronger than the prior beliefs.

To better illustrate the impact of a strong prior, we now consider an additional scenario where we strongly believe that most people do not have cardiovascular disease.

a5 <- 100; b5 <- 1000

curve(dbeta(x, a5, b5), main="Prior E",
      xlab=expression(theta))
abline(v=a5/(a5+b5), col="#D55E00", lty="dashed")

n <- length(y)
a5.post <- a5 + sum(y)
b5.post <- b5 + n - sum(y)

curve(dbeta(x, a5.post, b5.post), main="Scenario E",
      xlab=expression(theta), lwd=2)
curve(dbeta(x, a5, b5), lty="dashed", add=T, lwd=2)
abline(v=a5/(a5+b5), col="#D55E00", lty="dashed", lwd=2)
abline(v=mean(y), col="#009E73", lty="dotted", lwd=2)
abline(v=a5.post/(a5.post+b5.post), col="#0072B2", lty="dashed", lwd=2)
legend(.6,40, c("Prior", "Posterior", "Prior Mean", "MLE", "Post. Mean"), lwd=2,
       col=c("black", "black", "#D55E00", "#009E73", "#0072B2"), lty=c(2,1,2,3,2), bty="n")

measures <- rbind(measures, 
                  c(a5.post, b5.post,
                    Beta.Exp(a5.post, b5.post),
                    Beta.Mode(a5.post, b5.post),
                    Beta.Median(a5.post, b5.post),
                    Beta.Var(a5.post, b5.post)))
rownames(measures)[5] <- "E"
round(measures,4)
  a.post b.post    Exp   Mode Median    Var
A     44     58 0.4314 0.4300 0.4309 0.0024
B     53     67 0.4417 0.4407 0.4413 0.0020
C     53     62 0.4609 0.4602 0.4606 0.0021
D     48     67 0.4174 0.4159 0.4169 0.0021
E    143   1057 0.1192 0.1185 0.1190 0.0001

Interval Estimates:

Credible Sets (CS).

The CS are very easy to compute in this scenario. Indeed, it is sufficient to calculate the appropriate quantiles of the posterior distribution.

Let us consider scenario D:

a <- a4.post
b <- b4.post

curve(dbeta(x, a, b), main="Scenario D",
      xlab=expression(theta), lwd=2)

# CS of level 0.95
CS <- qbeta(c(0.025,0.975), a, b); CS
[1] 0.3291889 0.5083162
abline(v=CS,lty="dashed", col="#0072B2")
legend(0.8,8,c("Posterior", "Credible Set"),
       lty=c(1,2), col=c("black", "#0072B2"), bty="n")

Highest posterior density (HPD):

Let’s start by considering an initial value for \(h\).

h <- 2
curve(dbeta(x, a, b), 
      ylab=expression(paste(pi,"(",theta,"|x)")),
      xlab=expression(theta))
abline(h=h,lty=2)

We need to find the values of theta such that \(P(\theta|\textbf{y}) = h\).

To do this, we progressively lower the posterior density curve by an amount \(h\).

curve(dbeta(x, a, b), 
      ylab=expression(paste(pi,"(",theta,"|x)")),
      xlab=expression(theta))
abline(h=h,lty=2)

translated <- function(x, a, b) dbeta(x,a, b) - h 

curve(translated(x, a, b), add = T, lty = 3, lwd = 2)
abline(h = 0, lty = 3)

The values we are looking for are those where the translated curve —that is, the original posterior density minus \(h\)— becomes zero.

hpd1 <- uniroot(translated,c(.2, .4),a,b)$root; hpd1
[1] 0.3384793
hpd2 <- uniroot(translated,c(.45, .5),a,b)$root; hpd2
[1] 0.4962774
integrate(dbeta, lower=hpd1, upper=hpd2, shape1=a, shape2=b)
0.9152466 with absolute error < 2.5e-14

We have a probability equal to 0.9152. Thus, we have to select a smaller \(h\). Iteratively:

h.grid <- seq(1, 2, by = 0.01)   

res <- matrix(NA, ncol=4, nrow=length(h.grid))
colnames(res) <- c("HPD1", "HPD2", "level", "h")

for(i in 1:length(h.grid)){        
  translated <- function(x, a, b) dbeta(x,a, b) - h.grid[i] 
  
  hpd1<-uniroot(translated,c(.2,.4),a,b)$root
  hpd2<-uniroot(translated,c(.43,.55),a,b)$root
  
  I <- integrate(dbeta, lower=hpd1, upper=hpd2, shape1=a, shape2=b)$value
  
  iter <- i
  res[i,] <- c(hpd1, hpd2, I, h.grid[i])
  if(I <= 0.95) break
}

res[1:iter,]
           HPD1      HPD2     level    h
 [1,] 0.3225742 0.5134903 0.9635447 1.00
 [2,] 0.3228413 0.5132645 0.9630493 1.01
 [3,] 0.3230458 0.5130403 0.9626140 1.02
 [4,] 0.3232488 0.5128178 0.9621776 1.03
 [5,] 0.3234504 0.5125970 0.9617401 1.04
 [6,] 0.3236505 0.5123778 0.9613016 1.05
 [7,] 0.3238493 0.5121602 0.9608621 1.06
 [8,] 0.3240466 0.5119441 0.9604215 1.07
 [9,] 0.3242426 0.5117296 0.9599800 1.08
[10,] 0.3244373 0.5115165 0.9595373 1.09
[11,] 0.3246306 0.5113050 0.9590937 1.10
[12,] 0.3248227 0.5110948 0.9586491 1.11
[13,] 0.3250135 0.5108865 0.9582038 1.12
[14,] 0.3252030 0.5106809 0.9577589 1.13
[15,] 0.3253914 0.5104765 0.9573130 1.14
[16,] 0.3255785 0.5102735 0.9568661 1.15
[17,] 0.3257644 0.5100717 0.9564182 1.16
[18,] 0.3259492 0.5098712 0.9559692 1.17
[19,] 0.3261329 0.5096720 0.9555191 1.18
[20,] 0.3263154 0.5094740 0.9550681 1.19
[21,] 0.3264968 0.5092772 0.9546160 1.20
[22,] 0.3266772 0.5090816 0.9541629 1.21
[23,] 0.3268564 0.5088872 0.9537088 1.22
[24,] 0.3270346 0.5086939 0.9532537 1.23
[25,] 0.3272118 0.5085017 0.9527976 1.24
[26,] 0.3273880 0.5083107 0.9523404 1.25
[27,] 0.3275631 0.5081208 0.9518822 1.26
[28,] 0.3277373 0.5079320 0.9514231 1.27
[29,] 0.3279104 0.5077442 0.9509629 1.28
[30,] 0.3280827 0.5075575 0.9505017 1.29
[31,] 0.3282539 0.5073718 0.9500395 1.30
[32,] 0.3284243 0.5071872 0.9495763 1.31
h.grid[iter]
[1] 1.31
HPD <- res[iter-1,-c(3,4)]; HPD
     HPD1      HPD2 
0.3282539 0.5073718 
CS
[1] 0.3291889 0.5083162

We can compare the two interval estimates in a graphical way:

curve(dbeta(x, a, b), main="Scenario D",
      xlab=expression(theta), lwd=2)

abline(v=CS,lty="dashed", col="#0072B2")
abline(v=HPD,lty="dashed", col="#D55E00")

legend(0.6,8,c("Posterior", "Credible Set", "HPD"),
       lty=c(1,2,2), col=c("black", "#0072B2", "#D55E00"), bty="n")

HPD and CS are very similar, since the posterior is unimodal and almost symmetrical.

Hypothesis testing:

Let consider \(H_0: \theta <= 0.3\) vs \(H_1: \theta > 0.3\).

pbeta(.3, a4, b4)
[1] 0.4157988
pbeta(.3, a4.post, b4.post)
[1] 0.004031738
# ODDS.prior <- pbeta(.3, a4, b4)/(1-pbeta(.3, a4, b4)); ODDS.prior
# ODDS.post <- pbeta(.3, a4.post, b4.post)/(1-pbeta(.3, a4.post, b4.post)); ODDS.post
# 
# Bayes_Factor <- ODDS.post/ODDS.prior; Bayes_Factor

Data do not support \(H_0\): by adding data to the prior belief, our confidence on \(H_0\) strongly decreases

HOMEWORK:

Consider a sample of size \(n = 10\) composed of Bernoulli trials, for which we observed \(s = 3\) successes. Consider the prior \(\theta \sim Beta(2, 8)\).

    1. Compute the posterior distribution of theta.
    1. Compute the MLE and the prior and posterior means.
    1. Compute and compare the 95% CS and HPD.
    1. Let \(H_0 : \theta \in (0.3, 0.5)\) vs. \(H_1 : \theta \in (0, 0.3) \cup (0.5, 1)\). Which hypothesis do you support?