Most ebook files are in PDF format, so you can easily read them using various software such as Foxit Reader or directly on the Google Chrome browser.
Some ebook files are released by publishers in other formats such as .awz, .mobi, .epub, .fb2, etc. You may need to install specific software to read these formats on mobile/PC, such as Calibre.
Please read the tutorial at this link. https://ebooknice.com/page/post?id=faq
We offer FREE conversion to the popular formats you request; however, this may take some time. Therefore, right after payment, please email us, and we will try to provide the service as quickly as possible.
For some exceptional file formats or broken links (if any), please refrain from opening any disputes. Instead, email us first, and we will try to assist within a maximum of 6 hours.
EbookNice Team
Status:
Available0.0
0 reviewsISBN-10 : 1439898200
ISBN-13 : 9781439898208
Author: Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin
Winner of the 2016 De Groot Prize from the International Society for Bayesian AnalysisNow in its third edition, this classic book is widely considered the leading text on Bayesian methods, lauded for its accessible, practical approach to analyzing data and solving research problems. Bayesian Data Analysis, Third Edition continues to take an applied
Part I: Fundamentals of Bayesian Inference
Fundamentals of Bayesian Inference
Chapter 1 Probability and inference
1.1 The three steps of Bayesian data analysis
1.2 General notation for statistical inference
Parameters, data, and predictions
Observational units and variables
Exchangeability
Explanatory variables
Hierarchical modeling
1.3 Bayesian inference
Probability notation
Bayes’ rule
Prediction
Likelihood
Likelihood and odds ratios
1.4 Discrete probability examples: genetics and spell checking
Inference about a genetic status
Spelling correction
1.5 Probability as a measure of uncertainty
Subjectivity and objectivity
1.6 Example of probability assignment: football point spreads
Figure 1.1 Scatterplot of actual outcome vs. point spread for each of 672 professional football games. The × and y coordinates are jittered by adding uniform random numbers to each point's coordinates (between −0.1 and 0.1 for the × coordinate; between −0.2 and 0.2 for the y coordinate) in order to display multiple values but preserve the discrete-valued nature of each.
Football point spreads and game outcomes
Assigning probabilities based on observed frequencies
Figure 1.2 (a) Scatterplot of (actual outcome — point spread) vs. point spread for each of 672 professional football games (with uniform random jitter added to × and y coordinates). (b) Histogram of the differences between the game outcome and the point spread, with the N(0, 142) density superimposed.
A parametric model for the difference between outcome and point spread
Assigning probabilities using the parametric model
1.7 Example: estimating the accuracy of record linkage
Existing methods for assigning scores to potential matches
Figure 1.3 Histograms of weight scores y for true and false matches in a sample of records from the 1988 test Census. Most of the matches in the sample are true (because a pre-screening process has already picked these as the best potential match for each case), and the two distributions are mostly, but not completely, separated.
Estimating match probabilities empirically
Figure 1.4 Lines show expected false-match rate (and 95% bounds) as a function of the proportion of cases declared matches, based on the mixture model for record linkage. Dots show the actual false-match rate for the data.
External validation of the probabilities using test data
Figure 1.5 Expansion of Figure 1.4 in the region where the estimated and actual match rates change rapidly. In this case, it would seem a good idea to match about 88% of the cases and send the rest to followup.
1.8 Some useful results from probability theory
Modeling using conditional probability
Means and variances of conditional distributions
Transformation of variables
1.9 Computation and software
Summarizing inferences by simulation
Sampling using the inverse cumulative distribution function
Simulation of posterior and posterior predictive quantities
Table 1.1 Structure of posterior and posterior predictive simulations. The superscripts are indexes, not powers.
1.10 Bayesian inference in applied statistics
1.11 Bibliographic note
1.12 Exercises
Chapter 2 Single-parameter models
2.1 Estimating a probability from binomial data
Example. Estimating the probability of a female birth
Figure 2.1 Unnormalized posterior density for binomial parameter θ, based on uniform prior distribution and y successes out of n trials. Curves displayed for several values of n and y.
Historical note: Bayes and Laplace
Prediction
2.2 Posterior as compromise between data and prior information
2.3 Summarizing posterior inference
Figure 2.2 Hypothetical density for which the 95% central interval and 95% highest posterior density region dramatically differ: (a) central posterior interval, (b) highest posterior density region.
Posterior quantiles and intervals
2.4 Informative prior distributions
Binomial example with different prior distributions
Conjugate prior distributions
Nonconjugate prior distributions
Conjugate prior distributions, exponential families, and sufficient statistics
Example. Probability of a girl birth given placenta previa
Figure 2.3 Draws from the posterior distribution of (a) the probability of female birth, θ; (b) the logit transform, logit(θ); (c) the male-to-female sex ratio, φ = (1 − θ)/θ).
Table 2.1 Summaries of the posterior distribution of θ, the probability of a girl birth given placenta previa, under a variety of conjugate prior distributions.
Figure 2.4 (a) Prior density for θ in an example nonconjugate analysis of birth ratio example; (b) histogram of 1000 draws from a discrete approximation to the posterior density. Figures are plotted on different scales.
2.5 Estimating a normal mean with known variance
Likelihood of one data point
Conjugate prior and posterior distributions
Posterior predictive distribution
Normal model with multiple observations
2.6 Other standard single-parameter models
Normal distribution with known mean but unknown variance
Poisson model
Poisson model parameterized in terms of rate and exposure
Estimating a rate from Poisson data: an idealized example
Figure 2.5 Posterior density for θ, the asthma mortality rate in cases per 100,000 persons per year, with a Gamma(3.0, 5.0) prior distribution: (a) given y = 3 deaths out of 200,000 persons; (b) given y = 30 deaths in 10 years for a constant population of 200,000. The histograms appear jagged because they are constructed from only 1000 random draws from the posterior distribution in each case.
Exponential model
Figure 2.6 The counties of the United States with the highest 10% age-standardized death rates for cancer of kidney/ureter for U.S. white males, 1980–1989. Why are most of the shaded counties in the middle of the country? See Section 2.7 for discussion.
2.7 Example: informative prior distribution for cancer rates
A puzzling pattern in a map
Figure 2.7 The counties of the United States with the lowest 10% age-standardized death rates for cancer of kidney/ureter for U.S. white males, 1980–1989. Surprisingly, the pattern is somewhat similar to the map of the highest rates, shown in Figure 2.6.
Bayesian inference for the cancer death rates
Relative importance of the local data and the prior distribution
Figure 2.8 (a) Kidney cancer death rates yj/(10nj) vs. population size nj. (b) Replotted on the scale of log10 population to see the data more clearly. The patterns come from the discreteness of the data (nj = 0, 1, 2,...).
Figure 2.9 (a) Bayes-estimated posterior mean kidney cancer death rates, vs. logarithm of population size nj, the 3071 counties in the U.S. (b) Posterior medians and 50% intervals for θj for a sample of 100 counties j. The scales on the y-axes differ from the plots in Figure 2.8b.
Constructing a prior distribution
Figure 2.10 Empirical distribution of the age-adjusted kidney cancer death rates, for the 3071 counties in the U.S., along with the Gamma(20, 430,000) prior distribution for the underlying cancer rates θj.
2.8 Noninformative prior distributions
Proper and improper prior distributions
Improper prior distributions can lead to proper posterior distributions
Jeffreys’ invariance principle
Various noninformative prior distributions for the binomial parameter
Pivotal quantities
Difficulties with noninformative prior distributions
2.9 Weakly informative prior distributions
Constructing a weakly informative prior distribution
2.10 Bibliographic note
2.11 Exercises
Table 2.2 Worldwide airline fatalities, 1976–1985. Death rate is passenger deaths per 100 million passenger miles. Source: Statistical Abstract of the United States.
Chapter 3 Introduction to multiparameter models
3.1 Averaging over ‘nuisance parameters'
3.2 Normal data with a noninformative prior distribution
A noninformative prior distribution
The conditional posterior distribution, p(μ|σ2, y)
The marginal posterior distribution, p(σ2|y)
Sampling from the joint posterior distribution
Analytic form of the marginal posterior distribution of μ
Posterior predictive distribution for a future observation
Example. Estimating the speed of light
Figure 3.1 Histogram of Simon Newcomb's measurements for estimating the speed of light, from Stigler (1977). The data are recorded as deviations from 24,800 nanoseconds.
3.3 Normal data with a conjugate prior distribution
A family of conjugate prior distributions
The joint posterior distribution, p(μ, σ2|y)
The conditional posterior distribution, p(μ|σ2, y)
The marginal posterior distribution, p(σ2|y)
Sampling from the joint posterior distribution
Analytic form of the marginal posterior distribution of μ
3.4 Multinomial model for categorical data
Example. Pre-election polling
Figure 3.2 Histogram of values of (θ1 − θ2) for 1000 simulations from the posterior distribution for the election polling example.
3.5 Multivariate normal model with known variance
Multivariate normal likelihood
Conjugate analysis
3.6 Multivariate normal with unknown mean and variance
Conjugate inverse-Wishart family of prior distributions
Different noninformative prior distributions
Table 3.1: Bioassay data from Racine et al. (1986).
Scaled inverse-Wishart model
3.7 Example: analysis of a bioassay experiment
The scientific problem and the data
Modeling the dose—response relation
The likelihood
The prior distribution
Figure 3.3 (a) Contour plot for the posterior density of the parameters in the bioassay example. Contour lines are at 0.05, 0.15,..., 0.95 times the density at the mode. (b) Scatterplot of 1000 draws from the posterior distribution.
A rough estimate of the parameters
Obtaining a contour plot of the joint posterior density
Sampling from the joint posterior distribution
Figure 3.4 Histogram of the draws from the posterior distribution of the LD50 (on the scale of log dose in g/ml) in the bioassay example, conditional on the parameter β being positive.
The posterior distribution of the LD50
3.8 Summary of elementary modeling and computation
3.9 Bibliographic note
Table 3.2 Number of respondents in each preference category from ABC News pre- and post-debate surveys in 1988.
3.10 Exercises
Table 3.3 Counts of bicycles and other vehicles in one hour in each of 10 city blocks in each of six categories. (The data for two of the residential blocks were lost.) For example, the first block had 16 bicycles and 58 other vehicles, the second had 9 bicycles and 90 other vehicles, and so on. Streets were classified as ‘residential,’ ‘fairly busy,’ or ‘busy’ before the data were gathered.
Chapter 4 Asymptotics and connections to non-Bayesian approaches
4.1 Normal approximations to the posterior distribution
Normal approximation to the joint posterior distribution
Example. Normal distribution with unknown mean and variance
Interpretation of the posterior density function relative to its maximum
Summarizing posterior distributions by point estimates and standard errors
Data reduction and summary statistics
Lower-dimensional normal approximations
Figure 4.1 (a) Contour plot of the normal approximation to the posterior distribution of the parameters in the bioassay example. Contour lines are at 0.05, 0.15,..., 0.95 times the density at the mode. Compare to Figure 3.3a. (b) Scatterplot of 1000 draws from the normal approximation to the posterior distribution. Compare to Figure 3.3b.
Example. Bioassay experiment (continued)
Figure 4.2 (a) Histogram of the simulations of LD50, conditional on β > 0, in the bioassay example based on the normal approximation p(α, β|y). The wide tails of the histogram correspond to values of β close to 0. Omitted from this histogram are five simulation draws with values of LD50 less than −2 and four draws with values greater than 2; the extreme tails are truncated to make the histogram visible. The values of LD50 for the 950 simulation draws corresponding to β > 0 had a range of [-12.4, 5.4]. Compare to Figure 3.4. (b) Histogram of the central 95% of the distribution.
4.2 Large-sample theory
Notation and mathematical setup
Asymptotic normality and consistency
Likelihood dominating the prior distribution
4.3 Counterexamples to the theorems
4.4 Frequency evaluations of Bayesian inferences
Large-sample correspondence
Point estimation, consistency, and efficiency
Confidence coverage
4.5 Bayesian interpretations of other statistical methods
Maximum likelihood and other point estimates
Unbiased estimates
Example. Prediction using regression
Confidence intervals
Hypothesis testing
Multiple comparisons and multilevel modeling
Nonparametric methods, permutation tests, jackknife, bootstrap
Example. The Wilcoxon rank test
4.6 Bibliographic note
4.7 Exercises
Chapter 5 Hierarchical models
Table 5.1 Tumor incidence in historical control groups and current group of rats, from Tarone (1982). The table displays the values of (number of rats with tumors)/(total number of rats).
5.1 Constructing a parameterized prior distribution
Analyzing a single experiment in the context of historical data
Example. Estimating the risk of tumor in a group of rats
Figure 5.1: Structure of the hierarchical model for the rat tumor example.
Logic of combining information
5.2 Exchangeability and setting up hierarchical models
Exchangeability
Example. Exchangeability and sampling
Exchangeability when additional information is available on the units
Objections to exchangeable models
The full Bayesian treatment of the hierarchical model
The hyperprior distribution
Posterior predictive distributions
5.3 Fully Bayesian analysis of conjugate hierarchical models
Analytic derivation of conditional and marginal distributions
Drawing simulations from the posterior distribution
Application to the model for rat tumors
Figure 5.2 First try at a contour plot of the marginal posterior density of log(α+β)) for the rat tumor example. Contour lines are at 0.05, 0.15,..., 0.95 times the density at the mode.
Figure 5.3 (a) Contour plot of the marginal posterior density of for the rat tumor example. Contour lines are at 0.05, 0.15,..., 0.95 times the density at the mode. (b) Scatterplot of 1000 draws from the numerically computed marginal posterior density.
Figure 5.4 Posterior medians and 95% intervals of rat tumor rates, θj (plotted vs. observed tumor rates yj/nj), based on simulations from the joint posterior distribution. The 45° line corresponds to the unpooled estimates, The horizontal positions of the line have been jittered to reduce overlap.
5.4 Estimating exchangeable parameters from a normal model
The data structure
Constructing a prior distribution from pragmatic considerations
The hierarchical model
The joint posterior distribution
The conditional posterior distribution of the normal means, given the hyperparameters
The marginal posterior distribution of the hyperparameters
Computation
Posterior predictive distributions
Difficulty with a natural non-Bayesian estimate of the hyperparameters
5.5 Example: parallel experiments in eight schools
Inferences based on nonhierarchical models and their problems
Table 5.2 Observed effects of special preparation on SAT-V scores in eight randomized experiments. Estimates are based on separate analyses for the eight experiments.
Figure 5.5 Marginal posterior density, p(τ|y), for standard deviation of the population of school effects θj in the educational testing example.
Posterior simulation under the hierarchical model
Results
Figure 5.6 Conditional posterior means of treatment effects, E(θj|τ,y), as functions of the between-school standard deviation τ, for the educational testing example. The line for school C crosses the lines for E and F because C has a higher measurement error (see Table 5.2) and its estimate is therefore shrunk more strongly toward the overall mean in the Bayesian analysis.
Figure 5.7 Conditional posterior standard deviations of treatment effects, sd(θj|τ, y), as functions of the between-school standard deviation τ, for the educational testing example.
Discussion
Table 5.3: Summary of 200 simulations of the treatment effects in the eight schools.
Figure 5.8 Histograms of two quantities of interest computed from the 200 simulation draws: (a) the effect in school A, θ1; (b) the largest effect, max{θj}. The jaggedness of the histograms is just an artifact caused by sampling variability from using only 200 random draws.
Table 5.4 Results of 22 clinical trials of beta-blockers for reducing mortality after myocardial infarction, with empirical log-odds and approximate sampling variances. Data from Yusuf et al. (1985). Posterior quantiles of treatment effects are based on 5000 draws from a Bayesian hierarchical model described here. Negative effects correspond to reduced probability of death under the treatment.
5.6 Hierarchical modeling applied to a meta-analysis
Defining a parameter for each study
A normal approximation to the likelihood
Goals of inference in meta-analysis
What if exchangeability is inappropriate?
A hierarchical normal model
Table 5.5 Summary of posterior inference for the overall mean and standard deviation of study effects, and for the predicted effect in a hypothetical future study, from the meta-analysis of the beta-blocker trials in Table 5.4. All effects are on the log-odds scale.
Results of the analysis and comparison to simpler methods
5.7 Weakly informative priors for hierarchical variance parameters
Concepts relating to the choice of prior distribution
Classes of noninformative and weakly informative prior distributions for hierarchical variance parameters
Application to the 8-schools example
Figure 5.9 Histograms of posterior simulations of the between-school standard deviation, τ, from models with three different prior distributions: (a) uniform prior distribution on τ, (b) inverse-gamma(1, 1) prior distribution on τ2, (c) inverse-gamma(0.001, 0.001) prior distribution on τ2. Overlain on each is the corresponding prior density function for τ. (For models (b) and (c), the density for τ is calculated using the gamma density function multiplied by the Jacobian of the 1/τ2 transformation.) In models (b) and (c), posterior inferences are strongly constrained by the prior distribution.
Weakly informative prior distribution for the 3-schools problem
Figure 5.10 Histograms of posterior simulations of the between-school standard deviation, τ, from models for the 3-schools data with two different prior distributions on τ: (a) uniform (0, ∞), (b) half-Cauchy with scale 25, set as a weakly informative prior distribution given that τ was expected to be well below 100. The histograms are not on the same scales. Overlain on each histogram is the corresponding prior density function. With only J = 3 groups, the noninformative uniform prior distribution is too weak, and the proper Cauchy distribution works better, without appearing to distort inferences in the area of high likelihood.
5.8 Bibliographic note
5.9 Exercises
Part II: Fundamentals of Bayesian Data Analysis
Fundamentals of Bayesian Data Analysis
Chapter 6 Model checking
6.1 The place of model checking in applied Bayesian statistics
Sensitivity analysis and model improvement
Judging model flaws by their practical implications
6.2 Do the inferences from the model make sense?
Example. Evaluating election predictions by comparing to substantive political knowledge
External validation
Figure 6.1 Summary of a forecast of the 1992 U.S. presidential. election performed one month before the election. For each state, the proportion of the box that is shaded represents the estimated probability of Clinton winning the state; the width of the box is proportional to the number of electoral votes for the state.
Choices in defining the predictive quantities
6.3 Posterior predictive checking
Example. Comparing Newcomb's speed of light measurements to the posterior predictive distribution
Figure 6.2 Twenty replications, yrep, of the speed of light data from the posterior predictive distribution, p(yrep|y); compare to observed data, y, in Figure 3.1. Each histogram displays the result of drawing 66 independent values from a common normal distribution with mean and variance (μ, σ2) drawn from the posterior distribution, p(μ, σ2|y), under the normal model.
Figure 6.3 Smallest observation of Newcomb's speed of light data (the vertical line at the left of the graph), compared to the smallest observations from each of the 20 posterior predictive simulated datasets displayed in Figure 6.2.
Notation for replications
Test quantities
Tail-area probabilities
Figure 6.4 Realized vs. posterior predictive distributions for two more test quantities in the speed of light example: (a) Sample variance (vertical line at 115.5), compared to 200 simulations from the posterior predictive distribution of the sample variance. (b) Scatterplot showing prior and posterior simulations of a test quantity: T(y, θ) = |y(61) − θ| − |y(6) − θ| (horizontal axis) vs. (vertical axis) based on 200 simulations from the posterior distribution of (θ, yrep). The p-value is computed as the proportion of points in the upper-left half of the scatterplot.
Choosing test quantities
Example. Checking the assumption of independence in binomial trials
Figure 6.5 Observed number of switches. (vertical line at T(y) = 3), compared to 10,000 simulations from the posterior predictive distribution of the number of switches, T(yrep).
Example. Checking the fit of hierarchical regression models for adolescent smoking
Figure 6.6 Prevalence of regular (daily) smoking among participants responding at each wave in the study of Australian adolescents (who were on average 15 years old at wave 1).
Table 6.1 Summary of posterior predictive checks for three test statistics for two models fit to the adolescent smoking data: (1) hierarchical logistic regression, and (2) hierarchical logistic regression with a mixture component for never-smokers. The second model better fits the percentages of never-and always-smokers, but still has a problem with the percentage of ‘incident smokers,’ who are defined as persons who report incidents of nonsmoking followed by incidents of smoking.
Multiple comparisons
Interpreting posterior predictive p-values
Limitations of posterior tests
P-values and u-values
Model checking and the likelihood principle
Marginal predictive checks
6.4 Graphical posterior predictive checks
Figure 6.7 Left column displays observed data y (a 15 × 23 array of binary responses from each of 6 persons); right columns display seven replicated datasets yrep from a fitted logistic regression model. A misfit of model to data is apparent: the data show strong row and column patterns for individual persons (for example, the nearly white row near the middle of the last person's data) that do not appear in the replicates. (To make such patterns clearer, the indexes of the observed and each replicated dataset have been arranged in increasing order of average response.)
Direct data display
Figure 6.8 Redisplay of Figure 6.7 without ordering the rows, columns, and persons in order of increasing response. Once again, the left column shows the observed data and the right columns show replicated datasets from the model. Without the ordering, it is difficult to notice the discrepancies between data and model, which are easily apparent in Figure 6.7
Displaying summary statistics or inferences
Figure 6.9 Histograms of (a) 90 patient parameters and (b) 69 symptom parameters, from a single draw from the posterior distribution of a psychometric model. These histograms of posterior estimates contradict the assumed Beta(2, 2) prior densities (overlain on the histograms) for each batch of parameters, and motivated us to switch to mixture prior distributions. This implicit comparison to the values under the prior distribution can be viewed as a posterior predictive check in which a new set of patients and a new set of symptoms are simulated.
Figure 6.10 Histograms of (a) 90 patient parameters and (b) 69 symptom parameters, as estimated from an expanded psychometric model. The mixture prior densities (overlain on the histograms) are not perfect, but they approximate the corresponding histograms much better than the Beta(2, 2) densities in Figure 6.9.
Residual plots and binned residual plots
Figure 6.11 (a) Residuals (observed — expected) vs. expected values for a model of pain relief scores (0 = no pain relief..., 5 = complete pain relief). (b) Average residuals vs. expected pain scores, with measurements divided into 20 equally sized bins defined by ranges of expected pain scores. The average prediction errors are relatively small (note the scale of the y-axis), but with a consistent pattern that low predictions are too low and high predictions are too high. Dotted lines show 95% bounds under the model.
General interpretation of graphs as model checks
6.5 Model checking for the educational testing example
Assumptions of the model
Comparing posterior inferences to substantive knowledge
Posterior predictive checking
Sensitivity analysis
Figure 6.12 Posterior predictive distribution, observed result, and p-value for each of four test statistics for the educational testing example.
6.6 Bibliographic note
6.7 Exercises
Chapter 7 Evaluating, comparing, and expanding models
Example. Forecasting presidential elections
Figure 7.1 Douglas Hibbs's ‘bread and peace’ model of voting and the economy. Presidential elections since 1952 are listed in order of the economic performance at the end of the preceding administration (as measured by inflation-adjusted growth in average personal income). The better the economy, the better the incumbent party's candidate generally does, with the biggest exceptions being 1952 (Korean War) and 1968 (Vietnam War).
7.1 Measures of predictive accuracy
Predictive accuracy for a single data point
Averaging over the distribution of future data
Evaluating predictive accuracy for a fitted model
Choices in defining the likelihood and predictive quantities
7.2 Information criteria and cross-validation
Estimating out-of-sample predictive accuracy using available data
Log predictive density asymptotically, or for normal linear models
Figure 7.2 Posterior distribution of the log predictive density log p(y|θ) for the election forecasting example. The variation comes from posterior uncertainty in θ. The maximum value of the distribution, −40.3, is the log predictive density when θ is at the maximum likelihood estimate. The mean of the distribution is −42.0, and the difference between the mean and the maximum is 1.7, which is close to the value of 3/2 that would be predicted from asymptotic theory, given that we are estimating 3 parameters (two coefficients and a residual variance).
Example. Fit of the election forecasting model: Bayesian inference
Akaike information criterion (AIC)
Deviance information criterion (DIC) and effective number of parameters
Watanabe-Akaike or widely available information criterion (WAIC)
Effective number of parameters as a random variable
‘Bayesian’ information criterion (BIC)
Leave-one-out cross-validation
Comparing different estimates of out-of sample prediction accuracy
Example. Predictive error in the election forecasting model
7.3 Model comparison based on predictive performance
Example. Expected predictive accuracy of models for the eight schools
Table 7.1 Deviance (−2 times log predictive density) and corrections for parameter fitting using AIC, DIC, WAIC (using the correction pWAIC 2), and leave-one-out cross-validation for each of three models fitted to the data in Table 5.2. Lower values of AIC/DIC/WAIC imply higher predictive accuracy.Blank cells in the table correspond to measures that are undefined: AIC is defined relative to the maximum likelihood estimate and so is inappropriate for the hierarchical model; cross-validation requires prediction for the held-out case, which is impossible under the no-pooling model.The no-pooling model has the best raw fit to data, but after correcting for fitted parameters, the complete-pooling model has lowest estimated expected predictive error under the different measures. In general, we would expect the hierarchical model to win, but in this particular case, setting τ = 0 (that is, the complete-pooling model) happens to give the best average predictive performance.
Evaluating predictive error comparisons
Bias induced by model selection
Challenges
7.4 Model comparison using Bayes factors
Example. A discrete example in which Bayes factors are helpful
Example. A continuous example where Bayes factors are a distraction
7.5 Continuous model expansion
Sensitivity analysis
Adding parameters to a model
Accounting for model choice in data analysis
Selection of predictors and combining information
Alternative model formulations
Practical advice for model checking and expansion
7.6 Implicit assumptions and model expansion: an example
Table 7.2 Summary statistics for populations of municipalities in New York State in 1960 (New York City was represented by its five boroughs); all 804 municipalities and two independent simple random samples of 100. From Rubin (1983a).
Example. Estimating a population total under simple random sampling using transformed normal models
7.7 Bibliographic note
7.8 Exercises
Table 7.3 Short-term measurements of radon concentration (in picoCuries/liter) in a sample of houses in three counties in Minnesota. All measurements were recorded on the basement level of the houses, except for those indicated with asterisks, which were recorded on the first floor.
Chapter 8 Modeling accounting for data collection
8.1 Bayesian inference requires a model for data collection
Generality of the observed- and missing-data paradigm
Table 8.1: Use of observed- and missing-data terminology for various data structures.
8.2 Data-collection models and ignorability
Notation for observed and missing data
Stability assumption
Fully observed covariates
Data model, inclusion model, and complete and observed data likelihood
Joint posterior distribution of parameters θ from the sampling model and φ from the missing-data model
Finite-population and superpopulation inference
Ignorability
‘Missing at random’ and ‘distinct parameters'
Ignorability and Bayesian inference under different data-collection schemes
Propensity scores
Unintentional missing data
8.3 Sample surveys
Simple random sampling of a finite population
Stratified sampling
Table 8.2 Results of a CBS News survey of 1447 adults in the United States, divided into 16 strata. The sampling is assumed to be proportional, so that the population proportions, Nj/N, are approximately equal to the sampling proportions, nj/n.
Example. Stratified sampling in pre-election polling
Figure 8.1 Values of for 1000 simulations from the posterior distribution for the election polling example, based on (a) the simple nonhierarchical model and (b) the hierarchical model. Compare to Figure 3.2.
Table 8.3 Summary of posterior inference for the hierarchical analysis of the CBS survey in Table 8.2. The posterior distributions for the α1j's vary from stratum to stratum much less than the raw counts do. The inference for α2,16 for stratum 16 is included above as a representative of the 16 parameters α2j. The parameters μ1 and μ2 are transformed to the inverse-logit scale so they can be more directly interpreted.
Cluster sampling
Example. A survey of Australian schoolchildren
Unequal probabilities of selection
Example. Sampling of Alcoholics Anonymous groups
8.4 Designed experiments
Completely randomized experiments
Table 8.4 Yields of plots of millet arranged in a Latin square. Treatments A, B, C, D, E correspond to spacings of width 2, 4, 6, 8, 10 inches, respectively. Yields are in grams per inch of spacing. From Snedecor and Cochran (1989).
Randomized blocks, Latin squares, etc.
Example. Latin square experiment
Sequential designs
Including additional predictors beyond the minimally adequate summary
Example. An experiment with treatment assignments based on observed covariates
8.5 Sensitivity and the role of randomization
Complete randomization
Randomization given covariates
Designs that ‘cheat'
Bayesian analysis of nonrandomized studies
8.6 Observational studies
Comparison to experiments
Figure 8.2 Hypothetical-data illustrations of sensitivity analysis for observational studies. In each graph, circles and dots represent treated and control units, respectively. (a) The first plot shows balanced data, as from a randomized experiment, and the difference between the two lines shows the estimated treatment effect from a simple linear regression. (b, c) The second and third plots show unbalanced data, as from a poorly conducted observational study, with two different models fit to the data. The estimated treatment effect for the unbalanced data in (b) and (c) is highly sensitive to the form of the fitted model, even when the treatment assignment is ignorable.
Bayesian inference for observational studies
Table 8.5 Summary statistics from an experiment on vitamin A supplements, where the vitamin was available (but optional) only to those assigned the treatment. The table shows number of units in each assignment/exposure/outcome condition. From Sommer and Zeger (1991).
Causal inference and principal stratification
Example. A randomized experiment with noncompliance
Complier average causal effects and instrumental variables
Bayesian causal inference with noncompliance
8.7 Censoring and truncation
1. Data missing completely at random
2. Data missing completely at random with unknown probability of missingness
3. Censored data
4. Censored data with unknown censoring point
5. Truncated data
6. Truncated data with unknown truncation point
More complicated patterns of missing data
8.8 Discussion
8.9 Bibliographic note
8.10 Exercises
Table 8.6 Yields of penicillin produced by four manufacturing processes (treatments), each applied in five different conditions (blocks). Four runs were made within each block, with the treatments assigned to the runs at random. From Box, Hunter, and Hunter (1978), who adjusted the data so that the averages are integers, a complication we ignore in our analysis.
Table 8.7 Respondents to the CBS telephone survey classified by opinion and number of residential telephone lines (category ‘?’ indicates no response to the number of phone lines question).
Table 8.8 Respondents to the CBS telephone survey classified by opinion, number of residential telephone lines (category ‘?’ indicates no response to the number of phone lines question), and number of adults in the household (category ‘?’ includes all responses greater than 8 as well as nonresponses).
Chapter 9 Decision analysis
9.1 Bayesian decision theory in different contexts
Bayesian inference and decision trees
Summarizing inference and model selection
9.2 Using regression predictions: incentives for telephone surveys
Background on survey incentives
Figure 9.1 Observed increase zi in response rate vs. the increased dollar value of incentive compared to the control condition, for experimental data from 39 surveys. Prepaid and postpaid incentives are indicated by closed and open circles, respectively. (The graphs show more than 39 points because many surveys had multiple treatment conditions.) The lines show expected increases for prepaid (solid lines) and postpaid (dashed lines) cash incentives as estimated from a hierarchical regression.
Data from 39 experiments
Setting up a Bayesian meta-analysis
Inferences from the model
Figure 9.2 Residuals of response rate meta-analysis data plotted vs. predicted values. Residuals for telephone and face-to-face surveys are shown separately. As in Figure 9.1, solid and open circles indicate surveys with prepaid and postpaid incentives, respectively.
Inferences about costs and response rates for the Social Indicators Survey
Figure 9.3 Expected increase in response rate vs. net added cost per respondent, for prepaid (solid lines) and postpaid (dotted lines) incentives, for surveys of individuals and caregivers. On each plot, heavy lines correspond to the estimated effects, with light lines showing ±1 standard error bounds. The numbers on the lines indicate incentive payments. At zero incentive payments, estimated effects and costs are nonzero because the models have nonzero intercepts (corresponding to the effect of making any contact at all) and we are assuming a $1.25 mailing and processing cost per incentive.
Loose ends
9.3 Multistage decision making: medical screening
Example with a single decision point
Adding a second decision point
9.4 Hierarchical decision analysis for radon measurement
Figure 9.4 Lifetime added risk of lung cancer, as a function of average radon exposure in picoCuries per liter (pCi/L). The median and mean radon levels in ground-contact houses in the U.S. are 0.67 and 1.3 pCi/L, respectively, and over 50,000 homes have levels above 20 pCi/L.
Background
The individual decision problem
Decision-making under certainty
Bayesian inference for county radon levels
Bayesian inference for the radon level in an individual house
Decision analysis for individual homeowners
Figure 9.5 Recommended radon remediation/measurement decision as a function of the perfect-information action level Raction and the prior geometric mean radon level eM, under the simplifying assumption that eS = 2.3. You can read off your recommended decision from this graph and, if the recommendation is ‘take a measurement,’ you can do so and then perform the calculations to determine whether to remediate, given your measurement. The horizontal axis of this figure begins at 2 pCi/L because remediation is assumed to reduce home radon level to 2 pCi/L, so it makes no sense for Raction to be lower than that value. Wiggles in the lines are due to simulation variability.
Figure 9.6 Maps showing (a) fraction of houses in each county for which measurement is recommended, given the perfect-information action level of Raction = 4 pCi/L; (b) expected fraction of houses in each county for which remediation will be recommended, once the measurement y has been taken. For the present radon model, within any county the recommendations on whether to measure and whether to remediate depend only on the house type: whether the house has a basement and whether the basement is used as living space. Apparent discontinuities across the boundaries of Utah and South Carolina arise from irregularities in the radon measurements from the radon surveys conducted by those states, an issue we ignore here.
Aggregate consequences of individual decisions
Figure 9.7 Expected lives saved vs. expected cost for various radon measurement/remediation strategies. Numbers indicate values of Raction. The solid line is for the recommended strategy of measuring only certain homes; the others assume that all homes are measured. All results are estimated totals for the U.S. over a 30-year period.
9.5 Personal vs. institutional decision analysis
9.6 Bibliographic note
9.7 Exercises
Part III: Advanced Computation
Advanced Computation
Chapter 10 Introduction to Bayesian computation
Normalized and unnormalized densities
Log densities
10.1 Numerical integration
Simulation methods
Deterministic methods
10.2 Distributional approximations
Crude estimation by ignoring some information
10.3 Direct simulation and rejection sampling
Direct approximation by calculating at a grid of points
Figure 10.1 Illustration of rejection sampling. The top curve is an approximation function, Mg(θ), and the bottom curve is the target density, p(θ|y). As required, Mg(θ) ≥ p(θ|y) for all θ. The vertical line indicates a single random draw θ from the density proportional to g. The probability that a sampled draw θ is accepted is the ratio of the height of the lower curve to the height of the higher curve at the value θ.
Simulating from predictive distributions
Rejection sampling
10.4 Importance sampling
Accuracy and efficiency of importance sampling estimates
Importance resampling
Uses of importance sampling in Bayesian computation
10.5 How many simulation draws are needed?
Example. Educational testing experiments
10.6 Computing environments
The Bugs family of programs
Stan
Other Bayesian software
10.7 Debugging Bayesian computing
Debugging using fake data
Model checking and convergence checking as debugging
10.8 Bibliographic note
10.9 Exercises
Chapter 11 Basics of Markov chain simulation
Figure 11.1 Five independent sequences of a Markov chain simulation for the bivariate unit normal distribution, with overdispersed starting points indicated by solid squares. (a) After 50 iterations, the sequences are still far from convergence. (b) After 1000 iterations, the sequences are nearer to convergence. Figure (c) shows the iterates from the second halves of the sequences; these represent a set of (correlated) draws from the target distribution. The points in Figure (c) have been jittered so that steps in which the random walks stood still are not hidden. The simulation is a Metropolis algorithm described in the example on page 278, with a jumping rule that has purposely been chosen to be inefficient so that the chains will move slowly and their random-walk-like aspect will be apparent.
11.1 Gibbs sampler
Figure 11.2 Four independent sequences of the Gibbs sampler for a bivariate normal distribution with correlation ρ = 0.8, with overdispersed starting points indicated by solid squares. (a) First 10 iterations, showing the componentwise updating of the Gibbs iterations. (b) After 500 iterations, the sequences have reached approximate convergence. Figure (c) shows the points from the second halves of the sequences, representing a set of correlated draws from the target distribution.
Example. Bivariate normal distribution
11.2 Metropolis and Metropolis-Hastings algorithms
The Metropolis algorithm
Example. Bivariate unit normal density with normal jumping kernel
Relation to optimization
Why does the Metropolis algorithm work?
The Metropolis-Hastings algorithm
Relation between the jumping rule and efficiency of simulations
11.3 Using Gibbs and Metropolis as building blocks
Interpretation of the Gibbs sampler as a special case of the Metropolis-Hastings algorithm
Gibbs sampler with approximations
11.4 Inference and assessing convergence
Difficulties of inference from iterative simulation
Discarding early iterations of the simulation runs
Dependence of the iterations in each sequence
Figure 11.3 Examples of two challenges in assessing convergence of iterative simulations. (a) In the left plot, either sequence alone looks stable, but the juxtaposition makes it clear that they have not converged to a common distribution. (b) In the right plot, the twosequences happen to cover a common distribution but neither sequence appears stationary. These graphs demonstrate the need to use between-sequence and also within-sequence information when assessing convergence.
Multiple sequences with overdispersed starting points
Monitoring scalar estimands
Challenges of monitoring convergence: mixing and stationarity
Splitting each saved sequence into two parts
Assessing mixing using between- and within-sequence variances
Table 11.1 95% central intervals and estimated potential scale reduction factors for three scalar summaries of the bivariate normal distribution simulated using a Metropolis algorithm. (For demonstration purposes, the jumping scale of the Metropolis algorithm was purposely set to be inefficient; see Figure 11.1.) Displayed are inferences from the second halves of five parallel sequences, stopping after 50, 500, 2000, and 5000 iterations. The intervals for ∞ are taken from the known normal and marginal distributions for these summaries in the target distribution.
Example. Bivariate unit normal density with bivariate normal jumping kernel (continued)
11.5 Effective number of simulation draws
Bounded or long-tailed distributions
Stopping the simulations
Table 11.2 Coagulation time in seconds for blood drawn from 24 animals randomly allocated to four different diets. Different treatments have different numbers of observations because the randomization was unrestricted. From Box, Hunter, and Hunter (1978), who adjusted the data so that the averages are integers, a complication we ignore in our analysis.
11.6 Example: hierarchical normal model
Data from a small experiment
The model
Starting points
Gibbs sampler
Table 11.3 Summary of inference for the coagulation example. Posterior quantiles and estimated potential scale reductions are computed from the second halves of ten Gibbs sampler sequences, each of length 100. Potential scale reductions for σ and τ are computed on the log scale. The hierarchical standard deviation, τ, is estimated less precisely than the unit-level standard deviation, σ, as is typical in hierarchical modeling with a small number of batches.
Numerical results with the coagulation data
The Metropolis algorithm
Metropolis results with the coagulation data
11.7 Bibliographic note
11.8 Exercises
Table 11.4: Quality control measurements from 6 machines in a factory.
Chapter 12 Computationally efficient Markov chain simulation
12.1 Efficient Gibbs samplers
Transformations and reparameterization
Auxiliary variables
Example. Modeling the t distribution as a mixture of normals
Parameter expansion
Example. Fitting the t model (continued)
12.2 Efficient Metropolis jumping rules
Adaptive algorithms
12.3 Further extensions to Gibbs and Metropolis
Slice sampling
Reversible jump sampling for moving between spaces of differing dimensions
Example. Testing a variance component in a logistic regression
Simulated tempering and parallel tempering
Particle filtering, weighting, and genetic algorithms
12.4 Hamiltonian Monte Carlo
The momentum distribution, p(φ)
The three steps of an HMC iteration
Restricted parameters and areas of zero posterior density
Setting the tuning parameters
Varying the tuning parameters during the run
Locally adaptive HMC
Combining HMC with Gibbs sampling
12.5 Hamiltonian dynamics for a simple hierarchical model
Transforming to log τ
12.6 Stan: developing a computing environment
Entering the data and model
Setting tuning parameters in the warm-up phase
No-U-turn sampler
Inferences and postprocessing
12.7 Bibliographic note
12.8 Exercises
Chapter 13 Modal and distributional approximations
13.1 Finding posterior modes
Conditional maximization
Newton's method
Quasi-Newton and conjugate gradient methods
Numerical computation of derivatives
13.2 Boundary-avoiding priors for modal summaries
Posterior modes on the boundary of parameter space
Figure 13.1 Marginal posterior density, p(τ|y), for the standard deviation of the population of school effects θj in the educational testing example. If we were to choose to summarize this distribution by its mode, we would be in the uncomfortable position of setting an estimate on the boundary of parameter space.
Figure 13.2 From a simple one-dimensional hierarchical model with scale parameter 0.5 and data in 10 groups: (a) Sampling distribution of the marginal posterior mode of τ under a uniform prior distribution, based on 1000 simulations of data from the model. (b) 100 simulations of the marginal likelihood, p(y|τ). In this example, the point estimate is noisy and the likelihood function is not very informative about τ.
Figure 13.3 Various possible zero-avoiding prior densities for τ, the group-level standard deviation parameter in the 8 schools example. We prefer the gamma with 2 degrees of freedom, which hits zero at τ = 0 (thus ensuring a nonzero posterior mode) but clears zero for any positive τ. In contrast, the lognormal and inverse-gamma priors effectively shut off τ in some positive region near zero, or rule out high values of τ. These are behaviors we do not want in a default prior distribution. All these priors are intended for use in constructing penalized likelihood (posterior mode) estimates; if we were doing full Bayes and averaging over the posterior distribution of τ, we would be happy with a uniform or half-Cauchy prior density, as discussed in Section 5.7.
Zero-avoiding prior distribution for a group-level variance parameter
Boundary-avoiding prior distribution for a correlation parameter
Figure 13.4 From a simulated varying-intercept, varying-slope hierarchical regression with identity group-level covariance matrix: (a) Sampling distribution of the maximum marginal likelihood estimate of the group-level correlation parameter, based on 1000 simulations of data from the model. (b) 100 simulations of the marginal profile likelihood, Lprofile(ρ|y) = maxτ1,τ2 p(y|τ1,τ2, ρ). In this example, the maximum marginal likelihood estimate is extremely variable and the likelihood function is not very informative about ρ. (In some cases, the profile likelihood for ρ is flat in some places; this occurs when the corresponding estimate of one of the variance parameters (τ1 or τ2) is zero, in which case ρ is not identified.)
Degeneracy-avoiding prior distribution for a covariance matrix
13.3 Normal and related mixture approximations
Fitting multivariate normal densities based on the curvature at the modes
Laplace's method for analytic approximation of integrals
Mixture approximation for multimodal densities
Multivariate t approximation instead of the normal
Sampling from the approximate posterior distributions
13.4 Finding marginal posterior modes using EM
Derivation of the EM and generalized EM algorithms
Implementation of the EM algorithm
Example. Normal distribution with unknown mean and variance and partially conjugate prior distribution
Extensions of the EM algorithm
Supplemented EM and ECM algorithms
Parameter-expanded EM (PX-EM)
13.5 Approximating conditional and marginal posterior densities
Approximating the conditional posterior density, p(γ|φ, y)
Approximating the marginal posterior density, p(φ|y), using an analytic approximation to p(γ|φ, y)
13.6 Example: hierarchical normal model (continued)
Table 13.1 Convergence of stepwise ascent to a joint posterior mode for the coagulation example. The joint posterior density increases at each conditional maximization step, as it should. The posterior mode is in terms of log σ and log τ, but these values are transformed back to the original scale for display in the table.
Crude initial parameter estimates
Conditional maximization to find the joint mode of p(θ, μ, log σ, log τ|y)
Factoring into conditional and marginal posterior densities
Finding the marginal posterior mode of p(μ, log σ, log τ|y) using EM
Table 13.2 Convergence of the EM algorithm to the marginal posterior mode of (μ, log σ, log τ) for the coagulation example. The marginal posterior density increases at each EM iteration, as it should. The posterior mode is in terms of log σ and log τ, but these values are transformed back to the original scale for display in the table.
Table 13.3 Summary of posterior simulations for the coagulation example, based on draws from the normal approximation to p(μ, log σ, log τ|y) and the exact conditional posterior distribution, p(θ|μ, log σ, log τ, y). Compare to joint and marginal modes in Tables 13.1 and 13.2.
Constructing an approximation to the joint posterior distribution
Comparison to other computations
13.7 Variational inference
Minimization of Kullback-Leibler divergence
The class of approximate distributions
The variational Bayes algorithm
Example. Educational testing experiments
Figure 13.5 Progress of variational Bayes for the parameters governing the variational approximation for the hierarchical model for the 8 schools. After a random starting point, the parameters require about 50 iterations to reach approximate convergence. The lower-right graph shows the Kullback-Leibler divergence KL(g||p) (calculated up to an arbitrary additive constant); KL(g||p) is guaranteed to uniformly decrease if the variational algorithm is programmed correctly.
Figure 13.6 Progress of inferences for the effects in schools A, B, and C, for 100 iterations of variational Bayes. The lines and shaded regions show the median, 50% interval, and 90% interval for the variational distribution. Shown to the right of each graph are the corresponding quantiles for the full Bayes inference as computed via simulation.
Proof that each step of variational Bayes decreases the Kullback-Leibler divergence
Model checking
Variational Bayes followed by importance sampling or particle filtering
EM as a special case of variational Bayes
More general forms of variational Bayes
13.8 Expectation propagation
Expectation propagation for logistic regression
Example. Bioassay logistic regression with two coefficients
Figure 13.7 Progress of expectation propagation for a simple logistic regression with intercept and slope parameters. The bivariate normal approximating distribution is characterized by a mean and standard deviation in each dimension and a correlation. The algorithm reached approximate convergence after 4 iterations.
Figure 13.8 (a) Progress of the normal approximating distribution during the iterations of expectation propagation. The small ellipse at the bottom (which is actually a circle if x and y axes are placed on a common scale) is the starting distribution; after a few iterations the algorithm converges. (b) Comparison of the approximating distribution from EP (solid ellipse) to the simple approximation based on the curvature at the posterior mode (dotted ellipse) and the exact posterior density (dashed oval). The exact distribution is not normal so the EP approximation is not perfect, but it is closer than the mode-based approximation. All curves show contour lines for the density at 0.05 times the mode (which for the normal distribution contains approximately 95% of the probability mass; see discussion on page 85).
Extensions of expectation propagation
13.9 Other approximations
Integrated nested Laplace approximation (INLA)
Central composite design integration (CCD)
Approximate Bayesian computation (ABC)
13.10 Unknown normalizing factors
Posterior computations involving an unknown normalizing factor
Bridge and path sampling
13.11 Bibliographic note
13.12 Exercises
Part IV: Regression Models
Regression Models
bayesian data analysis solutions
bayesian data analysis rutgers
bayesian data analysis course
bayesian data analysis in python
bayesian data analysis pdf
Tags: Bayesian, data analysis, Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, Donald Rubin