class: center, middle, inverse, title-slide .title[ # Classical statistical inference: Applications ] .author[ ###
MACS 33000
University of Chicago ] --- `$$\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}} \newcommand{\sd}{\text{sd}} \newcommand{\Cor}{\mathrm{Cor}} \newcommand{\Lagr}{\mathcal{L}} \newcommand{\lagr}{\mathcal{l}}$$` # Misc: * Exam TOMORROW! * All calculators claimed * Rooms assigned! Have emailed assignment in Canvas! * Accommodations: I need to hear from YOU if you need accommodations -- I have talked with all people who I know have accommodations --- class: middle, inverse # Review: theoretical side --- # Understanding expected values of estimators KEY for homework: `$$E[\bar X] = \frac{1}{n} \sum_{i=1}^n E[X_i]$$` -- Translation: *when we take the expected value of an average of a random variable, we can take the average of the expected value for our random values* --- # Answering statistical questions ## Descriptive statistics * describe your data * don't come to any conclusions * generally give you a sense of what's there -- ## Inferential statistics * test a hypothesis * come to a conclusion * deal with evidence (sufficient / insufficient) and information --- # Types of errors: table Truth / Decision | Reject | Fail to Reject --------|---------|--------- `\(H_0\)` True | WRONG! | YES! | type I error | `\(H_0\)` False | YES! | WRONG! || type II error --- # Statistical tests: * One mean * Two means * Group mean * Proportions * Counts ( `\(\chi^2\)`) * Multiple means (ANOVA) --- # What is at stake? ## HOW WEIRD IS IT? Just need the information and framing of our question To answer, need the sample size to help figure out appropriate test --- # Cool, what test do I use? * **Counts**: if only two choices, use z test for proportions, if lots of choices or table, use `\(\chi^2\)` (easy to calc, somewhat problematic) or Fisher's exact (better but less frequently used) * **Means**: if two groups and small sample, t; if two groups and large-ish sample, z; if multiple groups (and decent sample size) ANOVA --- # OK SO HOW DO I DO IT???? -- General format: (observed mean)-(expected value) / SE -- Steps: * Find sample mean (AKA POINT ESTIMATE) -- * Find expected value -- * Find SE -- * Calculate in formula above -- * Evaluate / make decision -- * PROFIT??? --- # Evaluating a test statistic * Could use tables * Could use statistical software * Could use critical value approach: reject when calculated `\(z > z^*\)` -- | `\(z^*\)` | `\(1-\alpha\)`| `\(\alpha/2\)` | `\(\alpha\)`| | ---|------------|---------| | `\(\pm 1.65\)` | 90% | 0.05 | 0.10 | | `\(\pm 1.96\)` | 95% | 0.025 | 0.05 | | `\(\pm 2.58\)` | 99% | 0.005 | 0.01 | * `\(z^*\)` is the critical value while `\(z\)` is the test statistic we calculate! --- # Example: cholesterol data Consider a set of 371 individuals in a health study examining cholesterol levels (in mg/dl). 320 individuals have narrowing of the arteries, while 51 patients have no evidence of heart disease. **Is the mean cholesterol different in the two groups?** -- Let the estimated mean cholesterol levels for the first group be `\(\bar{X} = 216.2\)` and for the second group `\(\bar{Y} = 195.3\)`. Let the estimated standard error for each group be `\(\widehat{\se}(\hat{\mu}_1) = 5.0\)` and `\(\widehat{\se}(\hat{\mu}_2) = 2.4\)` -- The Wald test statistic is `$$W = \frac{\hat{\delta} - 0}{\widehat{\se}} = \frac{\bar{X} - \bar{Y}}{\sqrt{\widehat{\se}_1^2 + \widehat{\se}_2^2}} = \frac{216.2 - 195.3}{\sqrt{5^2 + 2.4^2}} = 3.78$$` -- To compute the `\(p\)`-value, let `\(Z \sim N(0,1)\)` denote a standard Normal random variable. Then `$$\text{p-value} = \Pr (|Z| > 3.78) = 2 \Pr(Z < -3.78) = 0.0002$$` --- # Confidence intervals We can conduct hypothesis testing with confidence intervals as well. From our cholesterol example before, suppose: Let the estimated mean cholesterol levels for the first group be `\(\bar{X} = 216.2\)` and for the second group `\(\bar{Y} = 195.3\)`. Let the estimated standard error for each group be `\(\widehat{\se}(\hat{\mu}_1) = 5.0\)` and `\(\widehat{\se}(\hat{\mu}_2) = 2.4\)` -- `$$(216.2-195.3) \pm 1.96* \sqrt{5^2 + 2.4^2}$$` -- `$$(10.03, 31.8)$$` --- # PROPORTIONS! We won't get into this deeply, but you effectively will always use z for proportions. The procedure is the same, BUT you may not receive the SE for a distribution. ... WHY?? -- Because we can think of these as a series of Bernoulli trials with a mean of `\(\pi\)` and var of `\(\pi(1-\pi)\)`. We can use this to determine the standard error! --- # Proportions: example Suppose we want to test whether support for a ballot measure differs between two groups. -- Group 1 (`\(n_1 = 100\)`) has `\(\hat{p}_1 = 0.55\)` in favor. -- Group 2 (`\(n_2 = 120\)`) has `\(\hat{p}_2 = 0.47\)` in favor. -- Null hypothesis: `\(H_0: p_1 - p_2 = 0\)` Alternative: `\(H_a: p_1 - p_2 \neq 0\)` -- Pooled proportion: `\(\hat{p} = \dfrac{55 + 56}{220} = 0.505\)` -- Standard error: `\(SE = \sqrt{\hat{p}(1-\hat{p})\left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}\)` `\(= \sqrt{0.5(0.5)\left(\dfrac{1}{100} + \dfrac{1}{120}\right)} \approx 0.064\)` --- ## Ex: cont'd Standard error: `\(SE = \sqrt{\hat{p}(1-\hat{p})\left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}\)` `\(= \sqrt{0.5(0.5)\left(\dfrac{1}{100} + \dfrac{1}{120}\right)} \approx 0.064\)` Test statistic: `\(z = \dfrac{0.55 - 0.47}{0.064} \approx 1.25\)` -- `\(p\)`-value = `\(P(Z < -1.25) \approx 0.21\)` **Note: this is kind of a funky way to write this -- I just prefer doing the left-hand side of things in z-tables. Conclusion: Not statistically significant at `\(\alpha = 0.05\)`. --- # Pearson's `\(\chi^2\)` test for multinomial data * Used for multinomial data * If `\(X = (X_1, \ldots, X_k)\)` has a multinomial `\((n,p)\)` distribution, then the MLE of `\(p\)` is `\(\hat{p} = (\hat{p}_1, \ldots, \hat{p}_k) = (x_1 / n, \ldots, x_k / n)\)`. -- Let `\(p_0 = (p_{01}, \ldots, p_{0k})\)` be some fixed vector and suppose we want to test `$$H_0: p = p_0 \quad \text{versus} \quad H_1: p \neq p_0$$` -- Pearson's `\(\chi^2\)` statistic is `$$T = \sum_{j=1}^k \frac{(X_j - np_{0j})^2}{np_{0j}} = \sum_{j=1}^k \frac{(X_j - \E[X_j])^2}{\E[X_j]}$$` where `\(\E[X_j] = np_{0j}\)` is the expected value under `\(H_0\)`. --- # Degrees of freedom and t tests When dealing with small samples, things get a little bit complicated since it's not *universal* -- it depends on the sample size-ish. It has to do with how many degrees of freedom exist, which is `\(n-1\)`. Therefore, each sample could potentially have a different degree of freedom. Otherwise, the process / procedure is exactly the same as before. --- # Attitudes towards tea * `\(H_A\)` - In a comparison of individuals, MAPSS students are more likely to favor coffee than tea relative to MACSS and CIR students * `\(H_0\)` - There is no difference in support between these groups for coffee vs tea. Any difference is the result of random sampling error. --- # Attitudes towards tea: If ideology did not matter | Tea > coffee | MAPSS | MACSS | CIR | Total | |-------------------|----------|----------|--------------|--------| | No | 40.8% | 40.8% | 40.8% | 40.8% | | | (206.45) | (289.68) | (271.32) | (768) | | Yes | 59.2% | 59.2% | 59.2% | 59.2% | | | (299.55) | (420.32) | (393.68) | (1113) | | Total | 26.9% | 37.7% | 35.4% | 100% | | | (506) | (710) | (665) | (1881) | --- # Attitudes towards tea: Observed data | Tea > coffee | MAPSS | MACSS | CIR | Total | |-------------------|---------|----------|--------------|--------| | No | 62.6% | 36.6% | 28.7% | 40.8% | | | (317) | (260) | (191) | (768) | | Yes | 37.4% | 63.4% | 71.28% | 59.2% | | | (189) | (450) | (474) | (1113) | | Total | 26.9% | 37.7% | 35.4% | 100% | | | (506) | (710) | (665) | (1881) | --- # Attitudes towards tea | Tea > coffee | MAPSS | MACSS | CIR | Total | |-------------------|---------------|---------|----------|--------------| | No | Obs Frequency `\(X_j\)` | 317.0 | 260.0 | 191.0 | | | Exp Frequency `\(\E[X_j]\)` | 206.6 | 289.9 | 271.5 | | | `\(X_j - \E[X_j]\)` | 110.4 | -29.9 | -80.5 | | | `\((X_j - \E[X_j])^2\)` | 12188.9 | 893.3 | 6482.7 | | | `\(\frac{(X_j - \E[X_j])^2}{\E[X_j]}\)` | **59.0** | **4.1** | **23.9** | | Yes | Obs Frequency `\(X_j\)` | 189.0 | 450.0 | 474.0 | | | Exp Frequency `\(\E[X_j]\)` | 299.4 | 420.1 | 393.5 | | | `\(X_j - \E[X_j]\)` | -110.4 | 29.9 | 80.5 | | | `\((X_j - \E[X_j])^2\)` | 12188.9 | 893.3 | 6482.7 | | | `\(\frac{(X_j - \E[X_j])^2}{\E[X_j]}\)` | **40.7** | **2.1** | **16.5** | --- # Attitudes towards tea Calculating test statistic * `\(\chi^2=\sum{\frac{(X_j - \E[X_j])^2}{\E[X_j]}}=145.27\)` * `\(\text{Degrees of freedom} = (\text{number of rows}-1)(\text{number of columns}-1)=2\)` Here, it would be `\(59+4.1+23.9+40.7+2.1+16.5=146.3\)`. -- Calculating `\(p\)`-value * `\(\text{p-value} = \Pr (\chi_2^2 > 145.27) = 0\)` -- Thus, we can reject the null that there's no association between our rows and columns. --- # A brief aside on degrees of freedom We can think of degrees of freedom as independent pieces of information we have. If we want to estimate a particular thing, like a mean, then we can have all of our observations vary except the last one so that we still have the correct mean. We're using up one piece of information for the mean. This isn't fully technically correct, but it is the correct intuition. --- class: middle, inverse # Extended example: one-sample, two-sample, and CIs --- Suppose you are interested in how many hours people spend online per week. In 2014 the average was 11.6 hours in a sample of 1,399 with a sample SD of 15.02. In 2004, the average time was 5.9 with a sample SD of 8.86 and a n of 1,574. In 2012 the average was 10.5 with a sample SD of 14.5 and n of 1,019.In both cases, can do hypothesis testing regarding whether the value from other years a) falls in the CI for the 2014 value or b) conduct a hypothesis test to determine whether the difference in values is statistically significant. -- * First: CI for 2014 (point estimate `\(\pm\)` `\(z^* \text{ } *se\)`) * Next: Difference of means test (difference - expected / se) * Look at 2014 vs 2004 * Look at 2014 vs 2012 * Third: Explore the differences in what we see --- # CI for 2014: Suppose you are interested in how many hours people spend online per week. **In 2014 the average was 11.6 hours in a sample of 1,399 with a sample SD of 15.02**. In **2004, the average time was 5.9 with a sample SD of 8.86 and a n of 1,574**. In **2012 the average was 10.5 with a sample SD of 14.5 and n of 1,019**. In both cases, can do hypothesis testing regarding whether the value from other years a) falls in the CI for the 2014 value or b) conduct a hypothesis test to determine whether the difference in values is statistically significant. `$$11.6 \pm 1.96 \frac{15.02}{\sqrt{1399}}$$` -- `$$(10.8, 12.4)$$` -- We are 95% confident that the true value in 2014 lies within this interval. -- Are 5.9 or 10.5 in the interval? What does this tell us? --- # Test for 2014 vs 2004: Suppose you are interested in how many hours people spend online per week. In 2014 the average was 11.6 hours in a sample of 1,399 with a sample SD of 15.02. In 2004, the average time was 5.9 with a sample SD of 8.86 and a n of 1,574. In 2012 the average was 10.5 with a sample SD of 14.5 and n of 1,019.In both cases, can do hypothesis testing regarding whether the value from other years a) falls in the CI for the 2014 value or b) conduct a hypothesis test to determine whether the difference in values is statistically significant. -- `$$z = \frac{(\bar x_1 - \bar x_2)-0}{se}$$` where SE `\(=\sqrt{\frac{se_1^2}{n_1} + \frac{se_2^2}{n_2}}\)`. -- `$$z = \frac{( 11.6 - 5.9)-0}{\sqrt{\frac{15.02^2}{1399} + \frac{8.86^2}{1574}}}$$` -- `$$z = 12.04$$` ... this is a lot! So, yeah, we can reject the null. --- # Test for 2014 vs 2012: Suppose you are interested in how many hours people spend online per week. In 2014 the average was 11.6 hours in a sample of 1,399 with a sample SD of 15.02. In 2004, the average time was 5.9 with a sample SD of 8.86 and a n of 1,574. In 2012 the average was 10.5 with a sample SD of 14.5 and n of 1,019.In both cases, can do hypothesis testing regarding whether the value from other years a) falls in the CI for the 2014 value or b) conduct a hypothesis test to determine whether the difference in values is statistically significant. -- `$$z = \frac{(\bar x_1 - \bar x_2)-0}{se}$$` where SE `\(=\sqrt{\frac{se_1^2}{n_1} + \frac{se_2^2}{n_2}}\)`. -- `$$z = \frac{( 11.6 - 10.5)-0}{\sqrt{\frac{15.02^2}{1399} + \frac{14.5^2}{1019}}}$$` -- `$$z = 1.81$$` BUT HOW? WHY do we see that this value is smaller than what we would use to reject? --- class: middle, inverse # A TL;DR of Bayesian v Frequentist statistics --- <img src="https://imgs.xkcd.com/comics/frequentists_vs_bayesians_2x.png" width="390px" style="display: block; margin: auto;" /> --- # Bayesians vs Fequentists * Frequentists assume the truth is OUT THERE and we are trying to measure it * Bayesians use the data as our source of truth -- * Frequentists live in the land of numbers (calculating probabilities, etc.) * Bayesians rely more on distributions (e.g. what is the data generating process vs only probability focus) --- # Bayesian Inference * Works how you probably *thought* frequentist statistics works -- * Uses information to provide context for how to understand data -- * We can make probability statements about parameters, even though they are fixed constants -- * We make inferences about a parameter `\(\theta\)` by producing a probability distribution for `\(\theta\)` -- Deals with prior beliefs and how our confidence changes as we gain more information (can think of it as how we deal with uncertainty) --- # Bayesian inference 1. Choose a prior distribution `\(f(\theta)\)` 1. Choose a statistical model `\(f(x|\theta)\)` * `\(f(x|\theta) \neq f(x; \theta)\)` 1. Calculate the posterior distribution `\(f(\theta | X_1, \ldots, X_n)\)` --- # Bayesian inference * Suppose that `\(\theta\)` is discrete and that there is a single, discrete observation `\(X\)` * `\(\Theta\)` is a random variable -- ##### Discrete random variable $$ `\begin{align} \Pr(\Theta = \theta | X = x) &= \frac{\Pr(X = x, \Theta = \theta)}{\Pr(X = x)} \\ &= \frac{\Pr(X = x | \Theta = \theta) \Pr(\Theta = \theta)}{\sum_\theta \Pr (X = x| \Theta = \theta) \Pr (\Theta = \theta)} \end{align}` $$ ##### Continuous random variable `$$f(\theta | x) = \frac{f(x | \theta) f(\theta)}{\int f(x | \theta) f(\theta) d\theta}$$` --- # Critique of Bayesian inference 1. The subjective prior is subjective -- 1. Probabilities on hypotheses are wrong. There is only one outcome -- 1. For many parametric models with large samples, Bayesian and frequentist methods give approximately the same inferences -- 1. Bayesian inference depends entirely on the likelihood function --- # Defense of Bayesian inference 1. The probability of hypotheses is exactly what we need to make decisions -- 1. Bayes' theorem is logically rigorous (once we obtain a prior) -- 1. By testing different priors we can see how sensitive our results are to the choice of prior -- 1. It is easy to communicate a result framed in terms of probabilities of hypotheses -- 1. Priors can be defended based on the assumptions made to arrive at it -- 1. Evidence derived from the data is independent of notions about "data more extreme" that depend on the exact experimental setup -- 1. Data can be used as it comes in. We don't have to wait for every contingency to be planned for ahead of time --- # Recap * Estimators: generate our point estimate * Point estimates: best guess * Testing: HOW WEIRD IS IT * Evaluation: * Find sample mean (AKA POINT ESTIMATE) * Find expected value * Find SE * Calculate in formula above * Evaluate / make decision * Bayes: different approach to statistics vs frequentist -- ### EXAM TOMORROW! GOOD LUCK!!