Classical statistical inference

class: center, middle, inverse, title-slide

.title[
# Classical statistical inference
]
.author[
### <a href="https://jmclip.github.io/MACSS_math_camp/">MACS 33000</a> <br /> University of Chicago
]

---

`$$\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}} \newcommand{\sd}{\text{sd}} \newcommand{\Cor}{\mathrm{Cor}} \newcommand{\Lagr}{\mathcal{L}} \newcommand{\lagr}{\mathcal{l}}$$`
# Misc

* ALMOST THERE!
* Tomorrow: Review!
  * TA session(s)
* Calculators!
* Recordings!

---

# Learning objectives

* Define classical statistical inference
* Summarize core concepts of point estimates
* Define estimators
* Identify confidence intervals
* Define hypothesis testing and `$p$`-value
* Explain the Wald test
* Summarize the `$\chi^2$` test of significance

---

# Statistical inference

Process of using data to infer the probability distribution/random variable that generated the data

Given a sample `$X_1, \ldots, X_n \sim F$`, how do we infer `$F$`?

---
class: middle

# Parametric models and functional form:

* Parametric models make assumptions about parameters / the underlying functional form (e.g. a linear relationship between variables)

---

# Parametric models

* Statistical model `$\xi$`
* Parametric model is a finite set `$\xi$`
* Examples of parametric models

`$$\xi \equiv f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2\sigma^2} (x - \mu)^2 \right], \quad \mu \in \Re, \sigma > 0$$`

* General form

`$$\xi \equiv f(x; \theta) : \theta \in \Theta$$`

* Nuisance parameters

---

# Examples of parametric models

#### One-dimensional parametric estimation

Let `$X_1, \ldots, X_n$` be independent observations drawn from a Bernoulli random variable with probability `$\pi$` of success

The problem is to estimate the parameter `$\pi$`

#### Two-dimensional parametric estimation

Suppose that `$X_1, \ldots, X_n \sim F$` and we assume that the PDF `$f \in \xi$` where

`$$\xi \equiv f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2\sigma^2} (x - \mu)^2 \right], \quad \mu \in \Re, \sigma > 0$$`

* Estimate `$\mu, \sigma^2$`
* Estimate `$\mu$` alone

---

# Point estimates from point estimatORS

A single "best guess" of some quantity of interest

* Parameter in a parametric model
* CDF `$F$`
* PDF `$f$`
* Regression function `$r$`
* Prediction for a future value `$Y$` of some random variable

Denote a point estimate of `$\theta$` by `$\hat{\theta}$` or `$\hat{\theta}_n$`

* `$\theta$` is a fixed, unknown quantity
* `$\hat{\theta}$` is a random variable

Let `$X_1, \ldots, X_n$` be `$n$` IID data points from some distribution `$F$`. A point estimator `$\hat{\theta}_n$` of a parameter `$\theta$` is some function of `$X_1, \ldots, X_n$`:

`$$\hat{\theta}_n = g(X_1, \ldots, X_n)$$`

---

# Properties of point estimators

---

# Properties of point estimators: BLUE (OLS)
--

## Best

## Linear

--
## Unbiased

--
## Estimate

---

# Properties of point estimators

##### Bias: does it give you what you expect?

`$$\text{bias}(\hat{\theta}_n) = \E_\theta (\hat{\theta_n}) - \theta$$`

* `$\E (\hat{\theta_n}) - \theta = 0$`
* Importance of unbiased estimators

---

# Properties of point estimators 
##### Consistency: does it get better with sample size?

As the number of observations `$n$` increases, the estimator converges towards the true parameter `$\theta$`

##### Efficiency: does it have small variance?

---
# Estimators: best guesses

Think of an estimator for WHAT your best will be. Here are some options (below) and how well they fare on our top 3 qualities we desire.

| Estimator | Unbiased    | Consistent | Efficient |
|-------    |-------------|-------------|------------|
| `$\bar{x}$`  | `$\checkmark$` | `$\checkmark$`  | `$\checkmark$` |
| `$x_i$`      | `$\checkmark$`  | `$\times$` | `$\checkmark$` |
| `$4$`        | `$\times$` | `$\times$` | `$\checkmark$`  | 
| median  (normal dist) | `$\checkmark$` | `$\checkmark$`  | `$\times$`|

---

# Properties of point estimators

##### Sampling distribution

* Distribution of `$\hat{\theta}_n$`
* Standard error of `$\hat{\theta}_n$`

`$$\se = \sd(\hat{\theta}_n) = \sqrt{\Var (\hat{\theta}_n)}$$`

* Estimating the standard error `$\widehat{\se}$`

---

# Mean squared error

$$
`\begin{align}
\text{MSE} &= \E_\theta [(\hat{\theta}_n - \theta)^2] \\
&= \text{bias}^2(\hat{\theta}_n) + \Var_\theta (\hat{\theta}_n)
\end{align}`
$$

* `$\E_\theta (\cdot)$` with respect to the distribution `$f(x_1, \ldots, x_n; \theta)$`

`$$\frac{\hat{\theta}_n - \theta}{\se} \leadsto N(0,1)$$`
---
# Understanding expected values of estimators

KEY for homework:

`$$E[\bar X] = \frac{1}{n} \sum_{i=1}^n E[X_i]$$`
--

* Translation: *the expected value of an average of a random variable ( `$\bar{X}$` ) equals the average of the sum of the expected values for our random variable*

---

# Bernoulli random variable: refresher

Bernoulli refresher: one-shot event with probability `$\pi$` of success.

Let `$X_1, \ldots, X_n ~ \text{Bernoulli}(\pi)$` and let `$\hat{\pi}_n = \frac{1}{n} \sum_{i=1}^n X_i$`. Then

`$$\E(\hat{\pi}_n) = \frac{1}{n} \sum_{i=1}^n \E(X_i) = \pi$$`

We can then compare our expected value from the distribution `$\pi$`, to what we obtained here, to verify they are equal.

We are adding up `$n$` `$\hat{\pi}$` and averaging them, where each of these *is* our `$\pi$`.

They are and thus the two are equal; so `$\hat{\pi}_n$` is unbiased

---

# Bernoulli random variable: SE

The standard error is

`$$\se = \sqrt{\Var (\hat{\pi}_n)}$$`
--
We can think of this as: `$\frac{Var(\pi_n)}{n}$`

HOW?? Recall `$\hat{\pi}_n = \frac{1}{n} \sum_{i=1}^n X_i$`, so `$Var(\hat{\pi_n})=Var(\frac{1}{n} \sum (X_i))$`. Then, `$Var(\hat{\pi_n})=(\frac{1}{n})^2 \text{ }Var( \sum (X_i))$`.

So, we can simplify to get `$Var(\hat{\pi_n})=(\frac{1}{n})^2 \text{ } \sum Var(X_i)$`. We can re-write this as `$\frac{n}{n^2} Var(X_i)=\frac{Var(X_i)}{n} = \frac{\pi(1-\pi)}{n}$`.

--
We are adding up all the variances across our random variables and averaging them.

---

Therefore, we get: `$\se= \sqrt{\frac{\pi (1 - \pi)}{n}}$`

`$$\widehat{\se} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}$$`

$$
`\begin{align}
\text{bias}(\hat{\pi}_n) &= \E_\pi (\hat{\pi}) - \pi \\
&= \pi - \pi \\
&= 0
\end{align}`
$$

and

`$$\se = \sqrt{\frac{\pi (1 - \pi)}{n}} \rightarrow 0$$`

`$\hat{\pi}_n$` is a consistent estimator of `$\pi$`

---

## Estimators: recap

* Estimators are WHAT/HOW we come to have a best guess of our thing we care about
--

* We need a way to think about / evaluate if they are good or not
--

* That way is called 'BLUE' (best, linear, unbiased)
--

* Using this is how we decided to use the sample mean as the best guess of the population mean (and other similar decisions)

---
class: middle, center, inverse

# Confidence: Precision vs accuracy

---

# Confidence sets

A ( `$1 - \alpha$` ) **confidence interval** for a parameter `$\theta$` is an interval `$C_n = (a,b)$` where `$a = a(X_1, \ldots, X_n)$` and `$b = b(X_1, \ldots, X_n)$` are functions of the data such that

`$$\Pr_{\theta} (\theta \in C_n) \geq 1 - \alpha, \quad \forall \, \theta \in \Theta$$` where `$\Theta$` is our universe.

`$(a,b)$` traps `$\theta$` with probability `$1- \alpha$`

Call `$1 - \alpha$` the **coverage** of the confidence interval

---

# Caution interpreting CIs

* `$C_n$` is random and `$\theta$` is fixed
* Common value of `$\alpha = 0.05$`
* A confidence interval is not a probability statement about `$\theta$`
    * `$\theta$` is a fixed quantity, not a random variable
    * `$\theta$` is or is not in the interval with probability `$1$`

> On day 1, you collect data and construct a 95% confidence interval for a parameter `$\theta_1$`. On day 2, you collect new data and construct a 95% confidence interval for a parameter `$\theta_2$`. You continue this way constructing confidence intervals for a sequence of unrelated parameters `$\theta_1, \theta_2, \ldots$`. Then 95% of your intervals will trap the true parameter value.

---

# Constructing confidence intervals

Suppose that `$\hat{\theta}_n \approx N(\theta, \widehat{\se}^2)$`. Let `$\Phi$` be the CDF of a standard Normal distribution and let

`$$z_{\frac{\alpha}{2}} = \Phi^{-1} \left(1 - \frac{\alpha}{2} \right)$$`

`$$\Pr (Z > \frac{\alpha}{2}) = \frac{\alpha}{2}$$`
and

`$$\Pr (-z_{\frac{\alpha}{2}} \leq Z \leq z_{\frac{\alpha}{2}}) = 1 - \alpha$$`

where `$Z \sim N(0,1)$`

`$$C_n = (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se}, \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se})$$`

---

## Confidence intervals: calculation

$$
`\begin{align}
\Pr_\theta (\theta \in C_n) &= \Pr_\theta (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se} < \theta < \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se}) \\
&= \Pr_\theta (- z_{\frac{\alpha}{2}} < \frac{\hat{\theta}_n - \theta}{\widehat{\se}} < z_{\frac{\alpha}{2}}) \\
&\rightarrow \Pr ( - z_{\frac{\alpha}{2}} < Z < z_{\frac{\alpha}{2}}) \\
&= 1 - \alpha
\end{align}`
$$

Thus, we have a rejection region of size `$\alpha$` and a confidence interval that contains or captures `$1-\alpha$` in mass.

---
### Confidence interval formula

`$$C_n = (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se}, \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se})$$`
`$$C_n = (\bar x - z_{\frac{\alpha}{2}} \widehat{\se}, \bar x_n + z_{\frac{\alpha}{2}} \widehat{\se})$$`
or

`$$(\bar x \pm z_{\frac{\alpha}{2}} \widehat{\se})$$`
where `$z_{\frac{\alpha}{2}}$` is our cutoff point from a z-table

---

# Z tables: how to read

* We are looking for the area under the tail of the curve where there is `$\frac{\alpha}{2}$`
* Rows are integer values and first decimal (tenths)
* Columns are second decimal (hundredths)

---

# Z tables: Example

---

# Common z values needed for CI:

CI level | `$\frac{\alpha}{2}$` | z
--------|-----------|---------
90% | 0.05 | 1.645
95% | 0.025 | 1.96
99% | 0.005 | 2.58

---
# CI: when to use

Confidence intervals give you information about your point estimate. You KNOW your point estimate is likely wrong. (very precise but not very accurate).

You can give a RANGE around your CI to improve your accuracy, at the cost of precision.

There is a delicate balance between accuracy and precision: for accuracy, you want to be as accurate as possible, but this means you have to have a wide range. We have conventionally compromised at 95% for the 'right' balance between the two.

---

## CI: Example

Suppose you have a point estimate of 0.65 with an SE of 0.03. Find a 90%, 95%, and 99% CI.

CI level | `$\frac{\alpha}{2}$` | z
--------|-----------|---------
90% | 0.05 | 1.645
95% | 0.025 | 1.96
99% | 0.005 | 2.58

`$$(\bar x \pm z_{\frac{\alpha}{2}} \widehat{\se})$$`
--

- **90% CI:** `$0.65 \pm 1.645 \times 0.03 = 0.65 \pm 0.049 = (0.601,\ 0.699)$`  
- **95% CI:** `$0.65 \pm 1.96 \times 0.03 = 0.65 \pm 0.059 = (0.591,\ 0.709)$`  
- **99% CI:** `$0.65 \pm 2.576 \times 0.03 = 0.65 \pm 0.077 = (0.573,\ 0.727)$`

#### NOTE: Our POINT ESTIMATE is in the center of the CI

---
class: middle, inverse

# Hypothesis testing: 
## Am I right or am I right?

---

# Hypothesis testing

* Null hypothesis
* Ask if the data **provide sufficient evidence** to reject the theory

Formally, partition the parameter space `$\Theta$` into two disjoint sets `$\Theta_0$` and `$\Theta_1$` and that we wish to test

`$$H_0: \theta \in \Theta_0 \quad \text{versus} \quad H_1: \theta \in \Theta_1$$`

* `$H_0$`
* `$H_1$`

Let `$X$` be a random variable and let `$\chi$` be the range of `$X$`. Test a hypothesis by finding an appropriate subset of outcomes `$R \subset \chi$` called the **rejection region**

If `$X \subset R$` we reject the null hypothesis, otherwise we do not reject the null hypothesis

`$$R = \left\{ x: T(x) > c \right\}$$`

---

# Types of errors: table

Truth / Decision | Reject | Fail to Reject
--------|---------|---------
`$H_0$` True | ? |  ?
`$H_0$` False | ? | ?

---

# Types of errors: table

Truth / Decision | Reject | Fail to Reject
--------|---------|---------
`$H_0$` True | ? | YES! 
`$H_0$` False | YES! |  ?

---

# Types of errors: table

Truth / Decision | Reject | Fail to Reject
--------|---------|---------
`$H_0$` True | WRONG! | YES! 
| type I error |
`$H_0$` False | YES! |  WRONG! 
|| type II error

---
# Positives and negative

Example: Suppose our null is that people support a six-hour exam. We need to understand if we have sufficient evidence against this or not.

Truth / Decision | Reject | Fail to Reject
--------|---------|---------
`$H_0$` T: YAY EXAM | WRONG! | YES! 
| type I error |
`$H_0$` F: BOO EXAM | YES! |  WRONG! 
|| type II error

---

# Types of errors: visual

---

# Power function

Power - probability of correctly rejecting the null hypothesis. Another way to consider this is to imagine you want to know if you COULD reject the null if it were false. You might now have enough POWER (either your sample size is too small and/or your null is slightly wrong).

The **power function** of a test with rejection region `$R$` is defined by

`$$\beta(\theta) = \Pr_\theta (X \in R)$$`

The size of a test is defined to be

`$$\alpha = \text{sup}_{\theta \in \Theta_0} \beta(\theta)$$`

A test is said to have **level** `$\alpha$` if its size is less than or equal to `$\alpha$`

---
class: center, middle, inverse

## Setting up hypothesis testing

---
# Sided tests
You are positing 'where' the true value is (larger or smaller) vs 'not here'

### Two-sided test

`$$H_0: \theta = \theta_0 \quad \text{versus} \quad H_1: \theta \neq \theta_0$$`

### One-sided test

`$$H_0: \theta \leq \theta_0 \quad \text{versus} \quad H_1: \theta > \theta_0$$`

`$$H_0: \theta \geq \theta_0 \quad \text{versus} \quad H_1: \theta < \theta_0$$`

---

# Example hypothesis test

Let `$X_1, \ldots, X_n \sim N(\mu, \sigma^2)$` where `$\sigma$` is known

Test `$H_0: \mu \leq 0$` versus `$H_1: \mu > 0$`

`$$\Theta_0 = (-\infty, 0], \quad\Theta_1 = (0, \infty]$$`

Consider the test

`$$\text{reject } H_0 \text{ if } T>c$$`

where `$T = \bar{X}$`. The rejection region is

`$$R = \left\{(x_1, \ldots, x_n): T(x_1, \ldots, x_n) > c \right\}$$`

---

# Example hypothesis test: power formula

Let `$Z$` denote the standard Normal random variable. The power function is

$$
`\begin{align}
\beta(\mu) &= \Pr_\mu (\bar{X} > c) \\
&= \Pr_\mu \left(\frac{\sqrt{n} (\bar{X} - \mu)}{\sigma} > \frac{\sqrt{n} (c - \mu)}{\sigma} \right) \\
&= \Pr_\mu \left(Z > \frac{\sqrt{n} (c - \mu)}{\sigma} \right) \\
&= 1 - \Phi \left( \frac{\sqrt{n} (c - \mu)}{\sigma} \right)
\end{align}`
$$

Think of this as the alternative universe where there is a different truth.

---

# Example hypothesis test: cdf

---

# Example hypothesis test: formula

`$$\alpha = \text{sup}_{\mu \leq 0} \beta(\mu) = \beta(0) = 1 - \Phi \left( \frac{\sqrt{n} (c)}{\sigma} \right)$$`

For a size `$\alpha$` test, we set this equal to `$\alpha$` and solve for `$c$` to get

`$$c = \frac{\sigma \Phi^{-1} (1 - \alpha)}{\sqrt{n}}$$`

We reject `$H_0$` when

`$$\bar{X} > \frac{\sigma \Phi^{-1} (1 - \alpha)}{\sqrt{n}}$$`

Equivalently, we reject when

`$$\frac{\sqrt{n}(\bar{X} - 0)}{\sigma} > z_\alpha$$`

where `$z_\alpha = \Phi^{-1} (1 - \alpha)$`.

---

# Example hypothesis test: formula, simplified

Two ways to approach:

* Cutoff: calculate your z and compare your z to your threshold
* Calculation: calculate your z and look up its probability relative to your alpha

Example:

`$$z=\frac{\bar x - \mu}{\sigma/\sqrt{n}}$$`

Note: we have two defaults that you can assume:

`$z$` cutoff is 1.96 and `$\alpha$` is 0.05.

---

# Example hypothesis test: formula, simplified ex
Example:

`$$z=\frac{\bar x - \mu}{\sigma/\sqrt{n}}$$`

Try: `$\bar x = 4.3$`, `$\mu = 4.1$`, `$\sigma = 0.89$`, `$n = 103$`.

`$z = \frac{4.3 - 4.1}{0.89 / \sqrt{103}} = \frac{0.2}{0.0877} \approx 2.28$`

--
So, reject b/c 2.28 is larger than our (default) cutoff of 1.96 OR calculate the p-value of 2.28 or more extreme using a z-table (approx 0.0226). This is less than our (default) `$\alpha$`, which is 0.05, so we reject it (too weird!).

---

# Wald test

Let `$\theta$` be a scalar parameter, let `$\hat{\theta}$` be an estimate of `$\theta$`, and let `$\widehat{\se}$` be the estimated standard error of `$\hat{\theta}$`. Consider testing

`$$H_0: \theta = \theta_0 \quad \text{versus} \quad H_1: \theta \neq \theta_0$$`

Assume that `$\hat{\theta}$` is asymptotically Normal:

`$$\frac{\hat{\theta} - \theta_0}{\widehat{\se}} \leadsto N(0,1)$$`

The size `$\alpha$` Wald test is: reject `$H_0$` when `$|W| > z_{\alpha / 2}$` where

`$$W = \frac{\hat{\theta} - \theta_0}{\widehat{\se}}$$`

---

# Power of the Wald test

Suppose the true value of `$\theta$` is `$\theta_* \neq \theta_0$`. The power `$\beta(\theta_*)$` is given (approximately) by

`$$1 - \Phi \left( \frac{\hat{\theta} - \theta_0}{\widehat{\se}} + z_{\alpha/2} \right) + \Phi \left( \frac{\hat{\theta} - \theta_0}{\widehat{\se}} - z_{\alpha/2} \right)$$`

* The power is large if `$\theta_*$` is far from `$\theta_0$`
* The power is large if the sample size is large

---
class: center, middle, inverse

## Details: hypothesis testing
---

# Example: comparing two means

Let `$X_1, \ldots, X_m$` and `$Y_1, \ldots, Y_n$` be two independent samples from populations with means `$\mu_1, \mu_2$` respectively. Test the null hypothesis that `$\mu_1 = \mu_2$`

`$$H_0: \delta = 0 \quad \text{versus} \quad H_1: \delta \neq 0$$`

where `$\delta = \mu_1 - \mu_2$`

The estimate of `$\delta$` is `$\hat{\delta} = \bar{X} - \bar{Y}$` with estimated standard error

`$$\widehat{\se} = \sqrt{\frac{s_1^2}{m} + \frac{s_2^2}{n}}$$`

where `$s_1^2$` and `$s_2^2$` are the sample variances

The size `$\alpha$` Wald test rejects `$H_0$` when `$|W| > z_{\alpha / 2}$` where

`$$W = \frac{\hat{\delta} - 0}{\widehat{\se}} = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s_1^2}{m} + \frac{s_2^2}{n}}}$$`

---

# Illustration: `$t$`-test

---

# Statistical vs. scientific significance

---

# Understanding `$p$` -values

Generally, if the test rejects at level `$\alpha$` it will also reject at level `$\alpha' > \alpha$`

Smallest `$\alpha$` at which the test rejects - the `$p$`-value

The smaller the `$p$`-value, the stronger the evidence against `$H_0$`

---

# Interpreting `$p$`-values

`$p$`-value  | evidence
-----------|-----------------------------------------
`$< 0.01$`    | very strong evidence against `$H_0$`
`$0.01 - 0.05$`| strong evidence against `$H_0$`
`$0.05 - 0.10$`| weak evidence against `$H_0$`
`$> 0.1$`     | little or no evidence against `$H_0$`

--
<img src="https://imgs.xkcd.com/comics/p_values_2x.png" width="220px" style="display: block; margin: auto;" />

---

# Calculating `$p$`-values

Suppose that the size `$\alpha$` test is of the form

`$$\text{reject } H_0 \text{ if and only if } T(X_n) \geq c_\alpha$$`
--

Then,

`$$\text{p-value} = \text{sup}_{\theta \in \Theta_0} \Pr_\theta (T(X^n) \geq T(x^n))$$`

where `$x^n$` is the observed value of `$X^n$`.

--
If `$\Theta_0 = \{ \theta_0 \}$` then

`$$\text{p-value} = \Pr_{\theta_0} (T(X^n) \geq T(x^n))$$`

---

# `$p$` -value for Wald test

Let

`$$w = \frac{\hat{\theta} - \theta_0}{\widehat{\se}}$$`

denote the observed value of the Wald statistic `$W$`. The `$p$`-value is given by

`$$\text{p-value} = \Pr_{\theta_0} (|W| > |w|) \approx \Pr (|Z| > |w| = 2 \Phi(-|w|)$$`

where `$Z \sim N(0,1)$`

---

# `$p$` -value for Wald test

---
# Working some examples (time permitting!)

# Two-Sample Difference of Means

**Sociology Example: Weekly household chores**

- Men (`$n_₁$` = 85): `$\bar{x}_1 = 14.2$`, `$s_1 = 4.1$`  
- Women (`$n_₂$` = 90): `$\bar{x}_2 = 15.0$`, `$s_2 = 4.3$`

## Step 1: Point Estimate

`$\bar{x}_1 - \bar{x}_2 = 14.2 - 15.0 = -0.8$`

## Step 2: Standard Error

`$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$`
`$= \sqrt{\frac{4.1^2}{85} + \frac{4.3^2}{90}}$`
`$= \sqrt{0.1977 + 0.2054} \approx 0.635$`

---

## Example: Interpretation

## Step 3: Test Statistic

`$z = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{SE}$`
`$= \frac{-0.8}{0.635} \approx -1.26$`

- Observed difference is 1.26 standard errors below 0  
- Two-tailed test p-value ≈ 0.2077
- Insufficient evidence to reject the null hypotehsis that women and men do equal chores; not statistically significant at 5%

---

# Example: Polling with margins

# Difference of Means Test: Polling Example

A polling organization is comparing support for a policy between **Urban** and **Rural** residents.

- Urban sample: `$n_1 = 120$`, `$\bar{x}_1 = 0.52$`, `$s_1 = 0.06$`  
- Rural sample: `$n_2 = 130$`, `$\bar{x}_2 = 0.55$`, `$s_2 = 0.055$`  
- Goal: Test whether the support differs between groups

## Step 1: Point Estimate

`$\bar{x}_1 - \bar{x}_2 = 0.52 - 0.55 = -0.03$`

## Step 2: Standard Error

`$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{0.06^2}{120} + \frac{0.055^2}{130}} = \sqrt{0.00003 + 0.000023} \approx 0.0073$`

---
## Ex, cont'd

## Step 3: Test Statistic

`$z = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{SE} = \frac{-0.03}{0.0073} \approx -4.125$`

## Interpretation

- Observed difference is 4.125 standard errors below 0  
- Two-tailed p-value < 0.0001
- Significant difference: Urban support is lower than Rural

---
## Power: ex calculation

# Difference of Means and Power: Polling Example

A polling organization is comparing support for a new policy between **Urban** and **Suburban** residents.

- Urban sample: `$n_1 = 100$`, `$\bar{x}_1 = 0.48$`, `$s_1 = 0.06$`  
- Suburban sample: `$n_2 = 120$`, `$\bar{x}_2 = 0.52$`, `$s_2 = 0.055$`  
- Goal: Test whether the support differs between groups and illustrate power

## Step 1: Point Estimate

`$\bar{x}_1 - \bar{x}_2 = 0.48 - 0.52 = -0.04$`

## Step 2: Standard Error

---
## Ex, power, cont'd:

## Step 2: Standard Error

`$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{0.06^2}{100} + \frac{0.055^2}{120}} = \sqrt{0.000036 + 0.000025} \approx 0.0080$`

--
## Step 3: Test Statistic

`$z = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{SE} = \frac{-0.04}{0.0080} \approx -5$`

## Step 4: Power Illustration

- Suppose the true difference is actually `$\delta = 0.03$` (Urban slightly lower)  
- Power is the probability that `$z$` would exceed the critical value in the correct direction

- Critical `$z$` for `$\alpha = 0.05$` (two-tailed) is `$z_{0.975} = 1.96$`  
- With this sample size, even small differences like `$\delta = 0.03$` give high `$z$` values → **high power**

---
# Recap

* Estimators: generate our point estimate
* Point estimates: best guess 
* Testing: HOW WEIRD IS IT
* Evaluation:
  * Find sample mean (AKA POINT ESTIMATE)
  * Find expected value
  * Find SE
  * Calculate in formula above
  * Evaluate / make decision
* PREVIEW: Bayes: different approach to statistics vs frequentist

---
class: center, middle, inverse
## Let’s get to know each other a bit more – Name, pronouns, subfield/research area, where you are currently, something fun/interesting about you and/or your hobbies

---
class: center, middle, inverse

# Bonus!

---

# Topics covered here (optional!)

* Define maximum likelihood estimation (MLE)
* Review the properties of the maximum likelihood estimator
* Demonstrate MLE for basic estimators

---
class: center, middle, inverse

# Maximum likelihood

---

# Maximum likelihood

Let `$X_1, \ldots, X_n$` be IID with PDF `$f(x; \theta)$`. The **likelihood function** is defined by

`$$\Lagr_n(\theta) = \prod_{i=1}^n f(X_i; \theta)$$`

The **log-likelihood function** is defined by `$\lagr_n (\theta) = \log \Lagr_n(\theta)$`

* Likelihood function is the joint density of the data
* Not the same as a PDF - it is a likelihood function

The **maximum likelihood estimator** `$\hat{\theta}_n$` is the value of `$\theta$` that maximizes `$\Lagr_n(\theta)$`

`$$\max (\Lagr_n(\theta)) \equiv \max (\lagr_n(\theta))$$`

---
---

# MLE of the mean of the Normal variable

`$$\Pr(X_i = x_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(X_i - \mu)^2}{2\sigma^2}\right]$$`

---

# MLE of the mean of the Normal variable

$$
`\begin{align}
\lagr_n(\mu, \sigma^2 | X) &= \log \prod_{i = 1}^{N}{\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(X_i - \mu)^2}{2\sigma^2}\right]} \\
&= \sum_{i=1}^{N}{\log\left(\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(X_i - \mu)^2}{2\sigma^2}\right]\right)} \\
&= -\frac{N}{2} \log(2\pi) - \left[ \sum_{i = 1}^{N} \log{\sigma^2 - \frac{1}{2\sigma^2}} (X_i - \mu)^2 \right]
\end{align}`
$$

---

# MLE of the mean of the Normal variable

<table>
<caption>Salaries of assistant professors</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> id </th>
   <th style="text-align:right;"> salary </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 60 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 55 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 65 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 50 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 70 </td>
  </tr>
</tbody>
</table>

---

# MLE of the mean of the Normal variable

<img src="13-frequentist-inference_files/figure-html/loglik-normal-1.png" width="864" style="display: block; margin: auto;" />
---
class: center, middle, inverse

# Maximum likelihood

---

# Maximum likelihood

Let `$X_1, \ldots, X_n$` be IID with PDF `$f(x; \theta)$`. The **likelihood function** is defined by

`$$\Lagr_n(\theta) = \prod_{i=1}^n f(X_i; \theta)$$`

The **log-likelihood function** is defined by `$\lagr_n (\theta) = \log \Lagr_n(\theta)$`

* Likelihood function is the joint density of the data
* Not the same as a PDF - it is a likelihood function

The **maximum likelihood estimator** `$\hat{\theta}_n$` is the value of `$\theta$` that maximizes `$\Lagr_n(\theta)$`

`$$\max (\Lagr_n(\theta)) \equiv \max (\lagr_n(\theta))$$`

---

# Bernoulli random variable

Suppose that `$X_1, \ldots, X_n \sim \text{Bernoulli} (\pi)$`. The probability function is

`$$f(x; \pi) = \pi^x (1 - \pi)^{1 - x}$$`

for `$x = 0,1$`. The unknown parameter is `$\pi$`.

$$
`\begin{align}
\Lagr_n(\pi) &= \prod_{i=1}^n f(X_i; \pi) \\
&= \prod_{i=1}^n \pi^{X_i} (1 - \pi)^{1 - X_i} \\
&= \pi^S (1 - \pi)^{n - S}
\end{align}`
$$

where `$S = \sum_{i} X_i$`.

The log-likelihood function is therefore

`$$\lagr_n (\pi) = S \log(\pi) + (n - S) \log(1 - \pi)$$`

1. Take the derivative of `$\lagr_n (\pi)$`
1. Set it equal to 0
1. Solve to find `$\hat{\pi}_n = \frac{S}{n}$`

---

# Bernoulli random variable
We're going to solve for the 'best' estimator, `$\hat{\pi}$`
`$$\lagr_n (\pi) = S \log(\pi) + (n - S) \log(1 - \pi)$$`
--
`$$\frac{d\lagr_n (\hat{\pi})}{d \hat{\pi}} = \frac{S}{\hat{\pi}} -  \frac{(n - S)}{1 - \hat{\pi}} = 0$$`
--

`$$\frac{S}{\hat{\pi}} =  \frac{(n - S)}{1 - \hat{\pi}}$$`

--
`$S(1-\hat{\pi}) =  \hat{\pi}(n - S)$` and `$S-S\hat{\pi} =  \hat{\pi}(n - S)$`

We can continue simplifying to get:

--
`$\hat{\pi} n - S \hat{\pi} + S\hat{\pi} = S$` so

`$$\hat{\pi} = \frac{S}{n}$$`

---

# Bernoulli random variable: plot

---
# Log-likelihood close-up

<img src="13-frequentist-inference_files/figure-html/unnamed-chunk-5-1.png" width="864" style="display: block; margin: auto;" />
---

# Properties of maximum likelihood estimators

1. Consistency
1. Equivariant
1. Asymptotically Normal
1. Asymptotically optimal or efficient

* Large sample sizes
* Smooth conditions for `$f(x; \theta)$`

---

# Properties of maximum likelihood estimators

##### Consistency

`$$\hat{\theta}_n \xrightarrow{P} \theta_*$$`

Meaning: it converges in probability to the true value

##### Equivariance

If `$\hat{\theta}_n$` is the MLE of `$\theta$`, then `$g(\hat{\theta}_n)$` is the MLE of `$g(\theta)$`

##### Asymptotic normality

Let `$\se = \sqrt{\Var (\hat{\sigma}_n)}$`

`$$\frac{\hat{\theta}_n - \theta_*}{\se} \leadsto N(0,1)$$`

`$$\frac{\hat{\theta}_n - \theta_*}{\widehat{\se}} \leadsto N(0,1)$$`

---

# Properties of maximum likelihood estimators

##### Optimality

Suppose that `$X_1, \ldots, X_n \sim N(\theta, \sigma^2)$`. The MLE is `$\hat{\sigma}_n = \bar{X}_n$`

Another reasonable estimator of `$\theta$` is the sample median `$\tilde{\theta}_n$`. The MLE satisfies

`$$\sqrt{n} (\hat{\theta}_n - \theta) \leadsto N(0, \sigma^2)$$`

Median satisfies

`$$\sqrt{n} (\tilde{\theta}_n - \theta) \leadsto N \left(0, \sigma^2 \frac{\pi}{2} \right)$$`

* The MLE has the smallest (asymptotic) variance
* No other estimator produces smaller variance