class: center, middle, inverse, title-slide .title[ # Classical statistical inference ] .author[ ###
MACS 33000
University of Chicago ] --- `$$\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}} \newcommand{\sd}{\text{sd}} \newcommand{\Cor}{\mathrm{Cor}} \newcommand{\Lagr}{\mathcal{L}} \newcommand{\lagr}{\mathcal{l}}$$` # Misc * ALMOST THERE! * MPCS placement exam * Tomorrow: Review! * TA session(s) * Name, program, pronouns, fave thing about Chicago so far! --- # Learning objectives * Define classical statistical inference * Summarize core concepts of point estimates * Define maximum likelihood estimation (MLE) * Review the properties of the maximum likelihood estimator * Demonstrate MLE for basic estimators * Identify confidence intervals * Define hypothesis testing and `\(p\)`-value * Explain the Wald test * Summarize the `\(\chi^2\)` test of significance --- # Statistical inference Process of using data to infer the probability distribution/random variable that generated the data -- Given a sample `\(X_1, \ldots, X_n \sim F\)`, how do we infer `\(F\)`? --- class: middle # Parametric models and functional form: * Parametric models make assumptions about parameters / the underlying functional form (e.g. a linear relationship between variables) --- # Parametric models * Statistical model `\(\xi\)` * Parametric model is a finite set `\(\xi\)` * Examples of parametric models `$$\xi \equiv f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2\sigma^2} (x - \mu)^2 \right], \quad \mu \in \Re, \sigma > 0$$` * General form `$$\xi \equiv f(x; \theta) : \theta \in \Theta$$` * Nuisance parameters --- # Examples of parametric models #### One-dimensional parametric estimation Let `\(X_1, \ldots, X_n\)` be independent observations drawn from a Bernoulli random variable with probability `\(\pi\)` of success -- The problem is to estimate the parameter `\(\pi\)` -- #### Two-dimensional parametric estimation Suppose that `\(X_1, \ldots, X_n \sim F\)` and we assume that the PDF `\(f \in \xi\)` where `$$\xi \equiv f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2\sigma^2} (x - \mu)^2 \right], \quad \mu \in \Re, \sigma > 0$$` -- * Estimate `\(\mu, \sigma^2\)` * Estimate `\(\mu\)` alone --- # Point estimates A single "best guess" of some quantity of interest * Parameter in a parametric model * CDF `\(F\)` * PDF `\(f\)` * Regression function `\(r\)` * Prediction for a future value `\(Y\)` of some random variable -- Denote a point estimate of `\(\theta\)` by `\(\hat{\theta}\)` or `\(\hat{\theta}_n\)` -- * `\(\theta\)` is a fixed, unknown quantity * `\(\hat{\theta}\)` is a random variable -- Let `\(X_1, \ldots, X_n\)` be `\(n\)` IID data points from some distribution `\(F\)`. A point estimator `\(\hat{\theta}_n\)` of a parameter `\(\theta\)` is some function of `\(X_1, \ldots, X_n\)`: `$$\hat{\theta}_n = g(X_1, \ldots, X_n)$$` --- # Properties of point estimates <img src="https://www.bluey.tv/wp-content/themes/bbc-bluey/assets/images/characters/bluey-star@2x.png" width="300px" style="display: block; margin: auto;" /> --- # Properties of point estimates: BLUE (OLS) -- ## Best -- ## Linear -- ## Unbiased -- ## Estimate --- # Properties of point estimates ##### Bias: does it give you what you expect? `$$\text{bias}(\hat{\theta}_n) = \E_\theta (\hat{\theta_n}) - \theta$$` -- * `\(\E (\hat{\theta_n}) - \theta = 0\)` * Importance of unbiased estimators --- # Properties of point estimates ##### Consistency: does it get better with sample size? As the number of observations `\(n\)` increases, the estimator converges towards the true parameter `\(\theta\)` -- ##### Efficiency: does it have small variance? --- # Estimators: best guesses Think of an estimator for WHAT your best will be. Here are some options (below) and how well they fare on our top 3 qualities we desire. -- | Estimator | Unbiased | Consistent | Efficient | |------- |-------------|-------------|------------| | `\(\bar{x}\)` | `\(\checkmark\)` | `\(\checkmark\)` | `\(\checkmark\)` | | `\(x_i\)` | `\(\checkmark\)` | `\(\times\)` | `\(\checkmark\)` | | `\(4\)` | `\(\times\)` | `\(\times\)` | `\(\checkmark\)` | | median (normal dist) | `\(\checkmark\)` | `\(\checkmark\)` | `\(\times\)`| --- # Properties of point estimates ##### Sampling distribution * Distribution of `\(\hat{\theta}_n\)` * Standard error of `\(\hat{\theta}_n\)` `$$\se = \sd(\hat{\theta}_n) = \sqrt{\Var (\hat{\theta}_n)}$$` * Estimating the standard error `\(\widehat{\se}\)` --- # Mean squared error $$ `\begin{align} \text{MSE} &= \E_\theta [(\hat{\theta}_n - \theta)^2] \\ &= \text{bias}^2(\hat{\theta}_n) + \Var_\theta (\hat{\theta}_n) \end{align}` $$ * `\(\E_\theta (\cdot)\)` with respect to the distribution `\(f(x_1, \ldots, x_n; \theta)\)` -- `$$\frac{\hat{\theta}_n - \theta}{\se} \leadsto N(0,1)$$` --- # Understanding expected values of estimators KEY for homework: `$$E[\bar X] = \frac{1}{n} \sum_{i=1}^n E[X_i]$$` -- * Translation: *the expected value of an average of a random variable ( `\(\bar{X}\)` ) equals the average of the sum of the expected values for our random variable* --- # Bernoulli random variable Bernoulli refresher: one-shot event with probability `\(\pi\)` of success. -- Let `\(X_1, \ldots, X_n ~ \text{Bernoulli}(\pi)\)` and let `\(\hat{\pi}_n = \frac{1}{n} \sum_{i=1}^n X_i\)`. Then `$$\E(\hat{\pi}_n) = \frac{1}{n} \sum_{i=1}^n \E(X_i) = \pi$$` We can then compare our expected value from the distribution `\(\pi\)`, to what we obtained here, to verify they are equal. -- We are adding up `\(n\)` `\(\hat{\pi}\)` and averaging them, where each of these *is* our `\(\pi\)`. -- They are and thus the two are equal; so `\(\hat{\pi}_n\)` is unbiased --- # Bernoulli random variable The standard error is `$$\se = \sqrt{\Var (\hat{\pi}_n)}$$` -- We can think of this as: `\(\frac{Var(\pi_n)}{n}\)` HOW?? Recall `\(E(\hat{\pi_n}) = \frac{1}{n} \sum E(X_i)\)`, so `\(Var(\hat{\pi_n})=Var(\frac{1}{n} \sum E(X_i))\)`. Then, `\(Var(\hat{\pi_n})=(\frac{1}{n})^2 \text{ }Var( \sum E(X_i))\)`. -- So, we can simplify to get `\(Var(\hat{\pi_n})=(\frac{1}{n})^2 \text{ } \sum Var(X_i)\)`. We can re-write this as `\(\frac{n}{n^2} Var(X_i)=\frac{Var(X_i)}{n} = \frac{\pi(1-\pi)}{n}\)`. -- We are adding up all the variances across our random variables and averaging them. -- Therefore, we get: `\(\se= \sqrt{\frac{\pi (1 - \pi)}{n}}\)` -- `$$\widehat{\se} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}$$` -- $$ `\begin{align} \text{bias}(\hat{\pi}_n) &= \E_\pi (\hat{\pi}) - \pi \\ &= \pi - \pi \\ &= 0 \end{align}` $$ and `$$\se = \sqrt{\frac{\pi (1 - \pi)}{n}} \rightarrow 0$$` -- `\(\hat{\pi}_n\)` is a consistent estimator of `\(\pi\)` --- ## Estimators: recap * Estimators are WHAT/HOW we come to have a best guess of our thing we care about -- * We need a way to think about / evaluate if they are good or not -- * That way is called 'BLUE' (best, linear, unbiased) -- * Using this is how we decided to use the sample mean as the best guess of the population mean (and other similar decisions) --- class: center, middle, inverse # Maximum likelihood --- # Maximum likelihood Let `\(X_1, \ldots, X_n\)` be IID with PDF `\(f(x; \theta)\)`. The **likelihood function** is defined by `$$\Lagr_n(\theta) = \prod_{i=1}^n f(X_i; \theta)$$` -- The **log-likelihood function** is defined by `\(\lagr_n (\theta) = \log \Lagr_n(\theta)\)` -- * Likelihood function is the joint density of the data * Not the same as a PDF - it is a likelihood function -- The **maximum likelihood estimator** `\(\hat{\theta}_n\)` is the value of `\(\theta\)` that maximizes `\(\Lagr_n(\theta)\)` -- `$$\max (\Lagr_n(\theta)) \equiv \max (\lagr_n(\theta))$$` --- # Bernoulli random variable Suppose that `\(X_1, \ldots, X_n \sim \text{Bernoulli} (\pi)\)`. The probability function is `$$f(x; \pi) = \pi^x (1 - \pi)^{1 - x}$$` for `\(x = 0,1\)`. The unknown parameter is `\(\pi\)`. -- $$ `\begin{align} \Lagr_n(\pi) &= \prod_{i=1}^n f(X_i; \pi) \\ &= \prod_{i=1}^n \pi^{X_i} (1 - \pi)^{1 - X_i} \\ &= \pi^S (1 - \pi)^{n - S} \end{align}` $$ where `\(S = \sum_{i} X_i\)`. -- The log-likelihood function is therefore `$$\lagr_n (\pi) = S \log(\pi) + (n - S) \log(1 - \pi)$$` -- 1. Take the derivative of `\(\lagr_n (\pi)\)` 1. Set it equal to 0 1. Solve to find `\(\hat{\pi}_n = \frac{S}{n}\)` --- # Bernoulli random variable We're going to solve for the 'best' estimator, `\(\hat{\pi}\)` `$$\lagr_n (\pi) = S \log(\pi) + (n - S) \log(1 - \pi)$$` -- `$$\frac{d\lagr_n (\hat{\pi})}{d \hat{\pi}} = \frac{S}{\hat{\pi}} - \frac{(n - S)}{1 - \hat{\pi}} = 0$$` -- `$$\frac{S}{\hat{\pi}} = \frac{(n - S)}{1 - \hat{\pi}}$$` -- `\(S(1-\hat{\pi}) = \hat{\pi}(n - S)\)` and `\(S-S\hat{\pi} = \hat{\pi}(n - S)\)` We can continue simplifying to get: -- `\(\hat{\pi} n - S \hat{\pi} + S\hat{\pi} = S\)` so -- `$$\hat{\pi} = \frac{S}{n}$$` --- # Bernoulli random variable <img src="13-frequentist-inference_files/figure-html/loglik-bern-1.png" width="864" style="display: block; margin: auto;" /> --- # Log-likelihood close-up <img src="13-frequentist-inference_files/figure-html/unnamed-chunk-3-1.png" width="864" style="display: block; margin: auto;" /> --- # Properties of maximum likelihood estimators 1. Consistency 1. Equivariant 1. Asymptotically Normal 1. Asymptotically optimal or efficient -- * Large sample sizes * Smooth conditions for `\(f(x; \theta)\)` --- # Properties of maximum likelihood estimators ##### Consistency `$$\hat{\theta}_n \xrightarrow{P} \theta_*$$` Meaning: it converges in probability to the true value -- ##### Equivariance If `\(\hat{\theta}_n\)` is the MLE of `\(\theta\)`, then `\(g(\hat{\theta}_n)\)` is the MLE of `\(g(\theta)\)` -- ##### Asymptotic normality Let `\(\se = \sqrt{\Var (\hat{\sigma}_n)}\)` `$$\frac{\hat{\theta}_n - \theta_*}{\se} \leadsto N(0,1)$$` -- `$$\frac{\hat{\theta}_n - \theta_*}{\widehat{\se}} \leadsto N(0,1)$$` --- # Properties of maximum likelihood estimators ##### Optimality Suppose that `\(X_1, \ldots, X_n \sim N(\theta, \sigma^2)\)`. The MLE is `\(\hat{\sigma}_n = \bar{X}_n\)` -- Another reasonable estimator of `\(\theta\)` is the sample median `\(\tilde{\theta}_n\)`. The MLE satisfies `$$\sqrt{n} (\hat{\theta}_n - \theta) \leadsto N(0, \sigma^2)$$` -- Median satisfies `$$\sqrt{n} (\tilde{\theta}_n - \theta) \leadsto N \left(0, \sigma^2 \frac{\pi}{2} \right)$$` -- * The MLE has the smallest (asymptotic) variance * No other estimator produces smaller variance --- # MLE of the mean of the Normal variable `$$\Pr(X_i = x_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(X_i - \mu)^2}{2\sigma^2}\right]$$` <img src="13-frequentist-inference_files/figure-html/plot-normal-1.png" width="864" style="display: block; margin: auto;" /> --- # MLE of the mean of the Normal variable $$ `\begin{align} \lagr_n(\mu, \sigma^2 | X) &= \log \prod_{i = 1}^{N}{\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(X_i - \mu)^2}{2\sigma^2}\right]} \\ &= \sum_{i=1}^{N}{\log\left(\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(X_i - \mu)^2}{2\sigma^2}\right]\right)} \\ &= -\frac{N}{2} \log(2\pi) - \left[ \sum_{i = 1}^{N} \log{\sigma^2 - \frac{1}{2\sigma^2}} (X_i - \mu)^2 \right] \end{align}` $$ --- # MLE of the mean of the Normal variable <table> <caption>Salaries of assistant professors</caption> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> salary </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 60 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 55 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 65 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 70 </td> </tr> </tbody> </table> --- # MLE of the mean of the Normal variable <img src="13-frequentist-inference_files/figure-html/loglik-normal-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle # Confidence: Precision vs accuracy --- # Confidence sets A ( `\(1 - \alpha\)` ) **confidence interval** for a parameter `\(\theta\)` is an interval `\(C_n = (a,b)\)` where `\(a = a(X_1, \ldots, X_n)\)` and `\(b = b(X_1, \ldots, X_n)\)` are functions of the data such that `$$\Pr_{\theta} (\theta \in C_n) \geq 1 - \alpha, \quad \forall \, \theta \in \Theta$$` -- `\((a,b)\)` traps `\(\theta\)` with probability `\(1- \alpha\)` -- Call `\(1 - \alpha\)` the **coverage** of the confidence interval --- # Caution interpreting CIs * `\(C_n\)` is random and `\(\theta\)` is fixed * Common value of `\(\alpha = 0.05\)` * A confidence interval is not a probability statement about `\(\theta\)` * `\(\theta\)` is a fixed quantity, not a random variable * `\(\theta\)` is or is not in the interval with probability `\(1\)` -- > On day 1, you collect data and construct a 95% confidence interval for a parameter `\(\theta_1\)`. On day 2, you collect new data and construct a 95% confidence interval for a parameter `\(\theta_2\)`. You continue this way constructing confidence intervals for a sequence of unrelated parameters `\(\theta_1, \theta_2, \ldots\)`. Then 95% of your intervals will trap the true parameter value. --- # Constructing confidence intervals Suppose that `\(\hat{\theta}_n \approx N(\theta, \widehat{\se}^2)\)`. Let `\(\Phi\)` be the CDF of a standard Normal distribution and let `$$z_{\frac{\alpha}{2}} = \Phi^{-1} \left(1 - \frac{\alpha}{2} \right)$$` -- `$$\Pr (Z > \frac{\alpha}{2}) = \frac{\alpha}{2}$$` and `$$\Pr (-z_{\frac{\alpha}{2}} \leq Z \leq z_{\frac{\alpha}{2}}) = 1 - \alpha$$` where `\(Z \sim N(0,1)\)` -- `$$C_n = (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se}, \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se})$$` -- $$ `\begin{align} \Pr_\theta (\theta \in C_n) &= \Pr_\theta (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se} < \theta < \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se}) \\ &= \Pr_\theta (- z_{\frac{\alpha}{2}} < \frac{\hat{\theta}_n - \theta}{\widehat{\se}} < z_{\frac{\alpha}{2}}) \\ &\rightarrow \Pr ( - z_{\frac{\alpha}{2}} < Z < z_{\frac{\alpha}{2}}) \\ &= 1 - \alpha \end{align}` $$ --- # CI: when to use Confidence intervals give you information about your point estimate. You KNOW your point estimate is likely wrong. (very precise but not very accurate). You can give a RANGE around your CI to improve your accuracy, at the cost of precision. -- There is a delicate balance between accuracy and precision: for accuracy, you want to be as accurate as possible, but this means you have to have a wide range. We have conventionally compromised at 95% for the 'right' balance between the two. --- class: middle # Hypothesis testing: ## Am I right or am I right? --- # Hypothesis testing * Null hypothesis * Ask if the data **provide sufficient evidence** to reject the theory -- Formally, partition the parameter space `\(\Theta\)` into two disjoint sets `\(\Theta_0\)` and `\(\Theta_1\)` and that we wish to test `$$H_0: \theta \in \Theta_0 \quad \text{versus} \quad H_1: \theta \in \Theta_1$$` * `\(H_0\)` * `\(H_1\)` -- Let `\(X\)` be a random variable and let `\(\chi\)` be the range of `\(X\)`. Test a hypothesis by finding an appropriate subset of outcomes `\(R \subset \chi\)` called the **rejection region** -- If `\(X \subset R\)` we reject the null hypothesis, otherwise we do not reject the null hypothesis `$$R = \left\{ x: T(x) > c \right\}$$` -- * `\(T\)` * `\(c\)` --- # Types of errors: table Truth / Decision | Reject | Fail to Reject --------|---------|--------- `\(H_0\)` True | ? | ? `\(H_0\)` False | ? | ? --- # Types of errors: table Truth / Decision | Reject | Fail to Reject --------|---------|--------- `\(H_0\)` True | ? | YES! `\(H_0\)` False | YES! | ? --- # Types of errors: table Truth / Decision | Reject | Fail to Reject --------|---------|--------- `\(H_0\)` True | WRONG! | YES! | type I error | `\(H_0\)` False | YES! | WRONG! || type II error --- # Positives and negative Example: Suppose our null is that people support a six-hour exam. We need to understand if we have sufficient evidence against this or not. -- Truth / Decision | Reject | Fail to Reject --------|---------|--------- `\(H_0\)` T: YAY EXAM | WRONG! | YES! | type I error | `\(H_0\)` F: BOO EXAM | YES! | WRONG! || type II error --- # Types of errors: visual <img src="https://i.stack.imgur.com/Kq0OH.jpg" style="display: block; margin: auto;" /> --- # Power function Power - probability of correctly rejecting the null hypothesis. Another way to consider this is to imagine you want to know if you COULD reject the null if it were false. You might now have enough POWER (either your sample size is too small and/or your null is slightly wrong). -- The **power function** of a test with rejection region `\(R\)` is defined by `$$\beta(\theta) = \Pr_\theta (X \in R)$$` -- The size of a test is defined to be `$$\alpha = \text{sup}_{\theta \in \Theta_0} \beta(\theta)$$` -- A test is said to have **level** `\(\alpha\)` if its size is less than or equal to `\(\alpha\)` --- class: center, middle ## Setting up hypothesis testing --- # Sided tests You are positing 'where' the true value is (larger or smaller) vs 'not here' -- ### Two-sided test `$$H_0: \theta = \theta_0 \quad \text{versus} \quad H_1: \theta \neq \theta_0$$` ### One-sided test `$$H_0: \theta \leq \theta_0 \quad \text{versus} \quad H_1: \theta > \theta_0$$` or `$$H_0: \theta \geq \theta_0 \quad \text{versus} \quad H_1: \theta < \theta_0$$` --- # Example hypothesis test Let `\(X_1, \ldots, X_n \sim N(\mu, \sigma^2)\)` where `\(\sigma\)` is known Test `\(H_0: \mu \leq 0\)` versus `\(H_1: \mu > 0\)` `$$\Theta_0 = (-\infty, 0], \quad\Theta_1 = (0, \infty]$$` -- Consider the test `$$\text{reject } H_0 \text{ if } T>c$$` where `\(T = \bar{X}\)`. The rejection region is `$$R = \left\{(x_1, \ldots, x_n): T(x_1, \ldots, x_n) > c \right\}$$` --- # Example hypothesis test Let `\(Z\)` denote the standard Normal random variable. The power function is $$ `\begin{align} \beta(\mu) &= \Pr_\mu (\bar{X} > c) \\ &= \Pr_\mu \left(\frac{\sqrt{n} (\bar{X} - \mu)}{\sigma} > \frac{\sqrt{n} (c - \mu)}{\sigma} \right) \\ &= \Pr_\mu \left(Z > \frac{\sqrt{n} (c - \mu)}{\sigma} \right) \\ &= 1 - \Phi \left( \frac{\sqrt{n} (c - \mu)}{\sigma} \right) \end{align}` $$ --- # Example hypothesis test <img src="13-frequentist-inference_files/figure-html/normal-cdf-1.png" width="864" style="display: block; margin: auto;" /> --- # Example hypothesis test `$$\alpha = \text{sup}_{\mu \leq 0} \beta(\mu) = \beta(0) = 1 - \Phi \left( \frac{\sqrt{n} (c)}{\sigma} \right)$$` For a size `\(\alpha\)` test, we set this equal to `\(\alpha\)` and solve for `\(c\)` to get `$$c = \frac{\sigma \Phi^{-1} (1 - \alpha)}{\sqrt{n}}$$` -- We reject `\(H_0\)` when `$$\bar{X} > \frac{\sigma \Phi^{-1} (1 - \alpha)}{\sqrt{n}}$$` -- Equivalently, we reject when `$$\frac{\sqrt{n}(\bar{X} - 0)}{\sigma} > z_\alpha$$` where `\(z_\alpha = \Phi^{-1} (1 - \alpha)\)`. --- # Wald test Let `\(\theta\)` be a scalar parameter, let `\(\hat{\theta}\)` be an estimate of `\(\theta\)`, and let `\(\widehat{\se}\)` be the estimated standard error of `\(\hat{\theta}\)`. Consider testing `$$H_0: \theta = \theta_0 \quad \text{versus} \quad H_1: \theta \neq \theta_0$$` -- Assume that `\(\hat{\theta}\)` is asymptotically Normal: `$$\frac{\hat{\theta} - \theta_0}{\widehat{\se}} \leadsto N(0,1)$$` -- The size `\(\alpha\)` Wald test is: reject `\(H_0\)` when `\(|W| > z_{\alpha / 2}\)` where `$$W = \frac{\hat{\theta} - \theta_0}{\widehat{\se}}$$` --- # Power of the Wald test Suppose the true value of `\(\theta\)` is `\(\theta_* \neq \theta_0\)`. The power `\(\beta(\theta_*)\)` is given (approximately) by `$$1 - \Phi \left( \frac{\hat{\theta} - \theta_0}{\widehat{\se}} + z_{\alpha/2} \right) + \Phi \left( \frac{\hat{\theta} - \theta_0}{\widehat{\se}} - z_{\alpha/2} \right)$$` -- * The power is large if `\(\theta_*\)` is far from `\(\theta_0\)` * The power is large if the sample size is large --- class: center, middle, inverse ## Details: hypothesis testing --- # Example: comparing two means Let `\(X_1, \ldots, X_m\)` and `\(Y_1, \ldots, Y_n\)` be two independent samples from populations with means `\(\mu_1, \mu_2\)` respectively. Test the null hypothesis that `\(\mu_1 = \mu_2\)` `$$H_0: \delta = 0 \quad \text{versus} \quad H_1: \delta \neq 0$$` where `\(\delta = \mu_1 - \mu_2\)` -- The estimate of `\(\delta\)` is `\(\hat{\delta} = \bar{X} - \bar{Y}\)` with estimated standard error `$$\widehat{\se} = \sqrt{\frac{s_1^2}{m} + \frac{s_2^2}{n}}$$` where `\(s_1^2\)` and `\(s_2^2\)` are the sample variances -- The size `\(\alpha\)` Wald test rejects `\(H_0\)` when `\(|W| > z_{\alpha / 2}\)` where `$$W = \frac{\hat{\delta} - 0}{\widehat{\se}} = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s_1^2}{m} + \frac{s_2^2}{n}}}$$` --- # `\(t\)`-test <img src="13-frequentist-inference_files/figure-html/t-dist-1.gif" style="display: block; margin: auto;" /> --- # Statistical vs. scientific significance <img src="http://www.azquotes.com/picture-quotes/quote-the-absence-of-evidence-is-not-the-evidence-of-absence-carl-sagan-43-51-12.jpg" style="display: block; margin: auto;" /> --- # Understanding `\(p\)` -values Generally, if the test rejects at level `\(\alpha\)` it will also reject at level `\(\alpha' > \alpha\)` Smallest `\(\alpha\)` at which the test rejects - the `\(p\)`-value -- The smaller the `\(p\)`-value, the stronger the evidence against `\(H_0\)` --- # Interpreting `\(p\)`-values `\(p\)`-value | evidence -----------|----------------------------------------- `\(< 0.01\)` | very strong evidence against `\(H_0\)` `\(0.01 - 0.05\)`| strong evidence against `\(H_0\)` `\(0.05 - 0.10\)`| weak evidence against `\(H_0\)` `\(> 0.1\)` | little or no evidence against `\(H_0\)` -- <img src="https://imgs.xkcd.com/comics/p_values_2x.png" width="220px" style="display: block; margin: auto;" /> --- # Calculating `\(p\)`-values Suppose that the size `\(\alpha\)` test is of the form `$$\text{reject } H_0 \text{ if and only if } T(X_n) \geq c_\alpha$$` -- Then, `$$\text{p-value} = \text{sup}_{\theta \in \Theta_0} \Pr_\theta (T(X^n) \geq T(x^n))$$` where `\(x^n\)` is the observed value of `\(X^n\)`. -- If `\(\Theta_0 = \{ \theta_0 \}\)` then `$$\text{p-value} = \Pr_{\theta_0} (T(X^n) \geq T(x^n))$$` --- # `\(p\)`-value for Wald test Let `$$w = \frac{\hat{\theta} - \theta_0}{\widehat{\se}}$$` denote the observed value of the Wald statistic `\(W\)`. The `\(p\)`-value is given by `$$\text{p-value} = \Pr_{\theta_0} (|W| > |w|) \approx \Pr (|Z| > |w| = 2 \Phi(-|w|)$$` where `\(Z \sim N(0,1)\)` --- # `\(p\)`-value for Wald test <img src="13-frequentist-inference_files/figure-html/wald-p-val-1.png" width="864" style="display: block; margin: auto;" /> --- # Recap * Estimators: generate our point estimate * Point estimates: best guess * Testing: HOW WEIRD IS IT * Evaluation: * Find sample mean (AKA POINT ESTIMATE) * Find expected value * Find SE * Calculate in formula above * Evaluate / make decision * PREVIEW: Bayes: different approach to statistics vs frequentist