Course Name: STAT_PROB_2024 Review Course on Probability and Statistics [2024]
Instructor: Dr. Olivia T.K. Choi Office: Room 208, Run Run Shaw Building P: 3917-1985 E: ochoi@hku.hk
Classes: This course is offered as pre-recorded online video. The videos are available on Moodle on the following dates: August12, 15, 19 and 22, 2024
Course Objectives: This course is designed for students newly admitted to the HKU Master of Statistics / Master of Data Science programmes, and serves as a review of basic probability and statistical concepts that are needed for the courses in these programmes.
Course Contents and Topics:

Chapter	Topics
1	Basic probability theory: Set; event; probability; conditional probability; Bayes’ rule; random variable; density/mass function and distribution function.
2	Distributions and moments: Bivariate distribution; joint, marginal and conditional distributions; expectation; variance; raw and central moments; higher moment; moment-generating function; covariance and correlation; iterated moments; common parametric distributions.
3	Sampling and estimation: Sample mean and variance; law of large numbers; central limit theorem; properties of point estimators; method of moments; maximum likelihood; interval estimation and confidence interval.
4	Hypothesis testing: Null and composite hypotheses; test statistic; type I/II errors; power function; p -value; duality principle; hypothesis tests regarding means, variances and proportions.

Assessment: This course is optional and no assessment is assigned. Students are however encouraged to solidify their conceptual understanding by attempting the exercises in the reference texts.
Reference text books:
Miller, I. and Miller, M.: John E. Freund’s Mathematical Statistics with Applications (2014, 8th edition)
DeGroot, M. H. and Schervish, M. J.: Probability and Statistics (2012, 4th edition)
Ross, S.: A First Course in Probability (2012, 9th edition)

Basic Probability Theory

Introduction

We will review the following topics:

Basic probablilty theory: Set; event; probability; Bayes’ rule; random variable; density/mass/distribution functions.
Distributions and moments: Joint/marginal/conditional distributions; moment and generating functions; covariance and correlation; iterated moments; common parametric distributions.
Sampling and estimation: Sample mean and variance; large sample theories; properties of point estimators; method of moments; maximum likelihood; interval estimation and confidence interval.
Hypothesis testing: Null and composite hypotheses; test statistic; type I/II errors; power functions; p-value; duality principle; hypothesis tests regarding means, variances and proportions.

Sets, events and probability

Sets and events

A sample space is a set containing all possible outcomes.
An event is a subset of the sample space.An empty event is the empty set.
For two or more sets, the intersection operator $\cap$ extracts elements common to both sets.
The intersection of sets cannot have more elements than the individual sets.
For two or more sets, the union operator $\cup$ combines elements from
both sets.
The union of sets cannot have fewer elements than the individual sets.
Two sets $E_1$ and $E_2$ are disjoint if $E_1 \cap E2 = \emptyset$ (nothing in common).
The sets $E_1,E_2,\dots,E_n$:
- are mutually exclusive if $E_i \cap E_j = \emptyset$ for all $i \ne j$
- are exhaustive if $E_1 \cup \dots \cup E_n = \Omega$, (make up the sample space);
- form a partition if the above two properties are true.
The complement of a set $E$ is a set that contains all elements not in $E$, denoted as $E^c$.

Probability of events

The probability operator $\mathbb{P}$ assigns a number to each event to denote its “likelihood” of happening.
- $\mathbb{P}(\emptyset) = 0, \mathbb{P}(\Omega) = 1$;
- $\mathbb{P}(E) \ge 0$ for any event $E$;
- $\mathbb{P}(E) + \mathbb{P}(E^c) = 1$;
- For mutually exclusive events $E_1,\dots,E_n, \mathbb{P}$ is additive, i.e., $\mathbb{P}(E_1\cup E_2\cup\dots E_n) = \mathbb{P}(E_1) + \mathbb{P}(E_2) + \dots + \mathbb{P}(E_n)$
The inclusion-exclusion formula is extremely useful to convert the probabilities of $\cup$ to $\cap$, and vice versa: $\mathbb{P}(\bigcup_{i=1}^n E_i)= \sum_{i} \mathbb{P}(E_i) - \sum_{i<j} \mathbb{P}(E_i\cap E_j) + \sum_{i<j<k} \mathbb{P}(E_i\cap E_j \cap E_k) + \dots + (-1)^{n+1}\mathbb{P}(\bigcap_{i=1}^n E_i)$ We typically use it for 2 or 4 events, i.e., $\mathbb{P}(E_1\cup E_2) = \mathbb{P}(E_1) + \mathbb{P}(E_2) - \mathbb{P}(E_1\cap E_2)$ $\mathbb{P}(E_1\cup E_2 \cup E_3) = \mathbb{P}(E_1)+\mathbb{P}(E_2)+\mathbb{P}(E_3)-\mathbb{P}(E_1\cap E_2)-\mathbb{P}(E_1\cap E_3)-\mathbb{P}(E_2\cap E_3)+\mathbb{P}(E_1\cap E_2\cap E_3)$

Conditional probability

Probabilities conditional on given information

Sometimes, we deal with probabilities on the condition that we know something in advance.

Calculating conditional probabilities

Definition: Conditional probability
For two events $A$ and $B$, the conditional probability of event $A$ given the occurrence of event $B$ is written as $\mathbb{P}(A|B)$, calculated as $\mathbb{P}(A|B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}$ if $\mathbb{P}(B)>0$.
This example simply reinstates that $\mathbb{P}(A\cap B)= \mathbb{P}(B)\mathbb{P}(A|B)$
Theorem: Law of total probability
For events $B_1,\dots, B_n$ that form a partition (i.e., mutually exclusive &
exhaustive) and event $A$, $\mathbb{P}(A) = \sum_{i=1}^n \mathbb{P}(A\cap B_i) = \sum_{i=1}^n \mathbb{P}(B_i)\mathbb{P}(A|B_i)$
Theorem: Multiplication rule
For events $B_1,\dots,B_n$ $\mathbb{P}(\bigcap_{i=1}^n B_i) = \mathbb{P}(B_1)\mathbb{P}(B_2|B_1)\mathbb{P}(B_3|B_1\cap B_2)\dots \mathbb{P}(B_n|\bigcap_{i=1}^{n-1}B_i)$

Independent events

Two events $A$ and $B$ are said to be independent (written as $A \perp\kern-5pt\perp B$) if $P(A) = P(A|B)$, i.e., occurrence of $B$ does not affect the chances of $A$ happening. This implies $\mathbb{P}(A \cap B) = \mathbb{P}(A) \mathbb{P}(B)$
This can be extended to more than two events: Events $A_1,\dots,A_n$ are mutally independent if and only if $\mathbb{P}(A_{k_1}\cap\dots\cap A_{k_m}) = \mathbb{P}(A_{k_1})\mathbb{P}(A_{k_2})\dots \mathbb{P}(A_{k_m})$ for every combination of $k_1\ne k_2\ne\dots\ne k_m$ and $m\le n$

Bayes’ rule

Theorem: Bayes’ rule
For two events $A$ and $B$ with $\mathbb{P}(A)\gt 0$ and $\mathbb{P}(B)\gt 0$ $\mathbb{P}(B|A)=\frac{\mathbb{P}(A|B)\mathbb{P}(B)}{\mathbb{P}(A)}$ $\mathbb{P}(B)$ is known as the prior probability and $\mathbb{P}(B|A)$ the posterior probability.
$\mathbb{P}(B)=\mathbb{P}(B|A)\iff \mathbb{P}(A)=\mathbb{P}(A|B)$,i.e., $A$ and $B$ are independent and thus $A$ adds no information on $B$.
In general, if $B_1,\dots,B_n$ constitute a partition, then $\mathbb{P}(B_j|A) = \frac{\mathbb{P}(A|B_j)\mathbb{P}(B_j)}{\sum_{i=1}^n \mathbb{P}(A|B_i)\mathbb{P}(B_i)}$

Random variables and distributions

Random variables

A random variable is a function that maps each element of the sample space to a real number.

A random variable is realized when we observe its value. We typically use capital letters (e.g., $X$, $Y$) to denote random variables and small letters (e.g., $x$, $y$) to denote their realizations.

There are three main types of random variables — discrete, continuous, and mixed.
Discrete random variables — pmf
A random variable $X$ is discrete if it can only take on a countable (possibly countably infinite) number of values.

Definition: Probability mass function
For a discrete random variable $X$, the probability mass function or pmf is defined as

for $x\in X(\Omega)$, where $X(\Omega)$ is the set of all possible values of $X$
A valid pmf has the following properties:
- $p_X(x)\ge 0$ for all $x$
- $\sum_{x\in X(\Omega)}p_X(x) = 1$
- For any subset $A\subset X(\Omega)$,$\mathbb{P}(X\in A)=\sum_{x\in A}p_X(x)$.
Discrete random variables — cdf
Definition: Cumulative distribution function
For a discrete random variable $X$, the cumulative distribution function or cdf is defined asfor $x \in \mathbb{R}$.It is often shortened as the distribution function of $X$.
A valid cdf has the following properties:
- $F_X(a)\le F_X(b)$ if $a\le b$
- $\lim_{x\rightarrow-\infty} F_X(x)=0; \lim_{x\rightarrow\infty} F_X(x)=1$
- $F_X(x)$ is right-continuous
- $\mathbb{P}(a\lt X\le b) = F(b)-F(a)$
Continuous random variables — pdf
Definition: Continuous random variable
A random variable $X$ is (absolutely) continuous if there exists a non-negative function $f$ defined on the real line such that

for every $a\le b$, The function $f(x)$ is known as the probability density function(pdf) of $X$.

A valid pdf has the following properties:
- $f(x)\ge 0$ for all $x$.
- $\int_{-\infty}^{\infty}f(x)\mathrm{d}x = \mathbb{P}(-\infty\le X\le \infty) = 1$
- $\int_A f(x)\mathrm{d}x=\mathbb{P}(X\in A)$ where $A$ is any subset of $\mathbb{R}$
Continuous random variables — cdf
Definition: Cumulative distribution function
For a continuous random variable $X$, the (cumulative) distribution function or cdf is defined as $F_X(x):=\mathbb{P}(X\le x) = \int_{-\infty}^x f_X(t)\mathrm{d}t$ for $x\in\mathbb{R}$.
Mixed random variables
This is known as a mixed random variable which has probability masses at some locations and densities at other locations.

More on Distributions and Moments

Bivariate distributions

Joint distributions
- Definition: Joint cumulative distribution function for 2 variables
  For two random variables $X$ and $Y$ , the joint (cumulative) distribution function or joint cdf is defined as $F_{X,Y}(x,y):=\mathbb{P}(X\le x, Y\le y)$ for $(x,y)\in \mathbb{R}^2$
- Definition: Joint probability mass function $X$ and $Y$ are jointly discrete if there exists a joint probability mass function (joint pmf) such that $p_{X,Y}(x,y):=\mathbb{P}(X=x,Y=y);F_{X,Y}(x,y)=\sum_{i\le x}\sum_{j\le y}p_{X,Y}(i,j)$
- Definition: Joint probability density function $X$ and $Y$ are jointly continuous if there exists a joint probability density function (joint pdf) such that $f_{X,Y}(x,y)\le 0\space \mathrm{for}\space \mathrm{all}\space x,y; F_{X,Y}(x,y)=\int_{-\infty}^y\int_{-\infty}^x f_{X,Y}(s,t)\mathrm{d}s \mathrm{d}t$
Marginal distributions
- Definition: Marginal pmf/pdf
  For $X$ and $Y$ jointly discrete, the marginal pmf of $X$ and $Y$ are respectively given by $p_X(x) = \mathbb{P}(X=x)=\sum_{y}p_{X,Y}(x,y); p_Y(y) = \mathbb{P}(Y=y)=\sum_{x}p_{X,Y}(x,y)$ For $X$ and $Y$ jointly continuous, the marginal pdf of $X$ and $Y$ are respectively given by $f_X(x) = \int_{\mathbb{R}} f_{X,Y}(x,y)\mathrm{d}y; f_Y(y) = \int_{\mathbb{R}} f_{X,Y}(x,y)\mathrm{d}x$
Conditional distributions
- Definition: Conditional pmf/pdf
  For $X$ and $Y$ jointly discrete, the conditional pmf of $Y$ given $X$ is $p_{Y|X}(y|x):=\mathbb{P}(Y=y|X=x)=\frac{\mathbb{P}(X=x,Y=y)}{\mathbb{P}(X=x)}=\frac{p_{X,Y}(x,y)}{p_X(x)}$ if $p_X(x)\gt 0$.
  For $X$ and $Y$ jointly continuous, the conditional pdf of Y given X is $f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}$ if $f_X(x)\gt 0$.
- The conditional cdf can be obtained from the conditional pmf/pdf: $F_{Y|X}(y|x) = \left\{ \begin{array}{ll} \sum_{i\le y} p_{Y|X}(i|x), & \mathrm{if\space discrete;} \\ \int_{-\infty}^y f_{Y|X}(t|x)\mathrm{d}t & \mathrm{if\space continuous.} \\ \end{array} \right.$
Independence of random variables
- Definition: Independent random variables
  Two random variables $X$ and $Y$ are independent if and only if the joint pmf/pdf is equal to the product of the marginal pmf’s/pdf’s, i.e., $p_{X,Y}(x,y) = p_X(x) p_Y(y)\space \mathrm{(discrete);}$ $f_{X,Y}(x,y) = f_X(x) f_Y(y)\space \mathrm{(continuous);}$ for all possible values of $x$ and $y$.
- The above definition also works for cdf’s, i.e.,
  $F_{X,Y}(x,y) = F_X(x) F_Y(y)$ for all $x$ and $y$.

Expectations and moments

Mathematical expectations

Definition: Expectation
- The expectation of a random variable $X$ (written as $\mathbb{E}(X)$) is defined as $\mathbb{E}(X) = \left\{ \begin{array}{ll} \sum_x x p_X(x), & \mathrm{if\space} X \mathrm{\space is\space discrete;} \\ \int_{\mathbb{R}}x f_X(x)\mathrm{d}x & \mathrm{if\space} X \mathrm{\space is\space continuous.} \\ \end{array} \right.$
- The expectation of g(X), a (known) function of X, can be defined similarly: $\mathbb{E}[g(X)] = \left\{ \begin{array}{ll} \sum_x g(x) p_X(x), & \mathrm{if\space} X \mathrm{\space is\space discrete;} \\ \int_{\mathbb{R}}g(x) f_X(x)\mathrm{d}x & \mathrm{if\space} X \mathrm{\space is\space continuous.} \\ \end{array} \right.$
- We say that the expectation (of $X$ or $g(X)$) does not exist if the sum or integral diverges.
- Properties of the expectation operator:
  1. $\mathbb{E}(aX+b)=a\mathbb{E}(X)+b$ for any constants $a$, $b$ and random variable $X$. We call $\mathbb{E}$ a linear operator.
  2. If $X\le Y$ for all possible outcomes in the sample space, then $\mathbb{E}(X) \le \mathbb{E}(Y)$.
  3. If $\mathbb{E}|X^a|$ exists for some $a \gt 0$, then $\mathbb{E}|X^b|$ exists for all $0 \lt b \lt a$. This also implies the existence of $\mathbb{E}(X^b)$.
Moments
- Definition: Moments
  The $n$th raw moment of a random variable $X$ is defined as $\mathbb{E}(X^n) = \left\{ \begin{array}{ll} \sum_x x^n p_X(x), & \mathrm{if\space} X \mathrm{\space is\space discrete;} \\ \int_{\mathbb{R}}x^n f_X(x)\mathrm{d}x & \mathrm{if\space} X \mathrm{\space is\space continuous.} \\ \end{array} \right.$ if it exists. The first raw moment is also known as the mean of $X$, often denoted by $\mu$.
  The $n$th central moment of a random variable $X$ is defined as $\mathbb{E}[(X-\mu)^n] = \left\{ \begin{array}{ll} \sum_x (x-\mu)^n p_X(x), & \mathrm{if\space} X \mathrm{\space is\space discrete;} \\ \int_{\mathbb{R}}(x-\mu)^n f_X(x)\mathrm{d}x & \mathrm{if\space} X \mathrm{\space is\space continuous.} \\ \end{array} \right.$ if it exists.
Means, variances and standard deviations
- Raw moments are moments about the origin; central moments are moments about the mean.
- Definition: Summary measures of a distribution
  The mean of $X$,$\mu$ (or $\mathbb{E}(X)$), measures the central tendency of $X$.
  The second central moment, $\mathbb{E}[(X-\mu)^2]$, is denoted by $\sigma^2$ or $\mathrm{Var}(X)$ and is known as the variance of $X$.
  The square root of $\sigma^2$, $\sigma$, is known as the standard deviation of $X$ and has the same unit as $X$. Both $\sigma$ and $\sigma^2$ measure the dispersion(spread) of $X$ about the mean.
- The variance is equal to the second raw moment minus the square of the mean: $\begin{align*} \mathbb{E}[(X-\mu)^2] &= \mathbb{E}(X^2-2\mu X + \mu^2) \\ &= \mathbb{E}(X^2) - 2\mu\mathbb{E}(X) + \mu^2 \\ &= \mathbb{E}(X^2) - \mu^2 \end{align*}$
- $\mathrm{Var}(a X + b)=a^2 \mathrm{Var}(X)$ for any constants $a$, $b$ and random variable $X$.
Higher moments
- Definition: Skewness (third moment)
  The third central moment provides a measure of the skewness (asymmetry) of the distribution. The coefficient of skewness is defined as $\mathbb{E}[(X-\mu)^3] / \sigma^3$ It is positive if the distribution is right-skewed, and negative if it is left-skewed.
- Definition: Kurtosis (fourth moment)
  The fourth central moment provides a measure of the kurtosis (tailedness) of the distribution. The coefficient of kurtosis is defined as $\mathbb{E}[(X-\mu)^4] / \sigma^4$ A leptokurtic distribution has fat tails (kurtosis > 3), and a platykurtic distribution has thin tails (kurtosis < 3).
Moment-generating functions
Definition: Moment-generating function
The moment-generating function (mgf) of a random variable $X$ is defined as $M_X(t) = \mathbb{E}(e^{tX})$ if it exists, with $t$ the argument of the mgf. It is possible that $M_X(t)$ is finite only on a subset of $\mathbb{R}$.
This function is “moment-generating” in the sense that we can obtain moments from it: $\mathbb{E}(X^n) = \frac{d^n}{dt^n}M_X(t)\bigg\vert_{t=0}$ From the definition, we obtain that
$M_{aX+b}(t) = \mathbb{E}[e^{t(aX+b)}] = e^{bt}\mathbb{E}(e^{atX}) = e^{bt}M_X(at)$ for constants $a,b$ , if $M_X(at)$ exists.
Moments of functions of random variables
For a function $g(X,Y)$ of two variables, the expectation is equal to Note the following useful properties, where $a,b$ are constants and $g,h$ are (known) functions.
1. $\mathbb{E}[a\cdot g(X) + b \cdot h(Y)] = a\mathbb{E}[g(X)]+b\mathbb{E}[h(Y)]\mathrm{(linearity)}$
2. If $X$ and $Y$ are independent, then $\mathbb{E}[g(X)\cdot h(Y)]=\mathbb{E}[g(X)]\cdot \mathbb{E}[h(Y)]$
3. If $X$ and $Y$ are independent, then $M_{aX+bY}(t)=M_X(at)M_Y(bt)$
Covariance and correlation
Definition: Covariance and correlation
For random variables $X,Y$ with means $\mu_X,\mu_Y$ and standard deviations $\sigma_X,\sigma_Y$, the covariance is defined by

The correlation coefficient is defined by

It can be shown that $-1\le \rho_{XY}\le 1$ for any $X,Y$ such that $\rho_{XY}$ exists.

Random variables having positive (negative) covariance/correlation coefficient are known as positively (negatively) correlated.

Some properties of the covariance/correlation of two random variables:
1. For independent variables, $\sigma_{XY}=\rho_{XY}=0$ . The reverse is not true!
2. $Var(X)=Cov(X,X)$.
3. $Cov(X,Y)=Cov(Y,X)$.
4. $Cov(aX+b,cY+d)=Cov(aX,cY)=acCov(X,Y)$.
5. $Var(aX+bY)=a^2Var(X)+b^2Var(Y)+2abCov(X,Y)$.
6. More generally, $Var(\sum_{i=1}^n a_iX_i) = \sum_{i=1}^n\sum_{i=1}^n a_ia_jX_iX_j$ .
7. $Cor(aX+b,cY+d)=sign(ac)Cor(X,Y)$,where $sign(ac)=-1$ if $ac\lt 0$ and 1 if $ac\gt 0$.
Conditional expectations $\mathbb{E}(Y|X=x)=\left\{ \begin{array}{ll} \sum_i i\cdot p_{Y|X}(i|X=x), & \mathrm{if\space discrete;} \\ \int_{-\infty}^{\infty} t\cdot f_{Y|X}(t|x)\mathrm{d}t & \mathrm{if\space continuous.} \\ \end{array} \right.$ This is known as the conditional mean. The conditional variance can be computed as $\mathbb{E}(Y^2|X=x)-[\mathbb{E}(Y|X=x)]^2$.
Iterated moments
Theorem: Law of total expectation/variance
For random variables $X,Y$, we have
$\mathbb{E}(X)=\mathbb{E}[\mathbb{E}(X|Y)]$ ;
$Var(X)=\mathbb{E}[Var(X|Y)]+Var[\mathbb{E}(X|Y)]$ .

Some common distributions

Discrete distribution — Uniform(a, b)

For integers $a$ and $b$ with $a \le b$, the discrete uniform distribution puts equal point masses at $a, a + 1, a + 2, \dots , b$.

pmf: $p_X(x) = \frac{1}{b-a+1},\space x\in \{ a,a+1,\dots,b \}$

cdf: $F_X(x)=\frac{\lfloor x\rfloor-a+1}{b-a+1},\space x\in[a,b]$

Mean: $\mathbb{E}(X)=\frac{a+b}{2}$

Variance: $\mathrm{Var}(X)=\frac{(b-a+1)^2-1}{12}$

mgf: (omitted)

Discrete distribution — Bernoulli(p)

The Bernoulli distribution models the number of successes of a single trial with success probability $p\in [0,1]$

pmf: $pX(x)=p^x(1-p)^{1-x},\space x\in \{0,1\}$

cdf: $F_X(x)=\left\{ \begin{array}{ll} 0 & x \in (-\infty,0) \\ 1-p & x\in [0,1)\\ 1 & x\in [1,\infty) \end{array} \right.$

Mean: $\mathbb{E}(X)=p$

Variance: $\mathrm{Var}(X)=p(1-p)$

mgf: $M_X(t)=(1-p)+pe^t,\space t\in\mathbb{R}$

Discrete distribution — Binomial(n,p)

The Binomial distribution models the number of successes of n independent trials, each with success probability $p \in [0, 1]$.

pmf: $p_X(x)=\binom{n}{x}p^x(1-p)^{n-x},\space x\in \{0,1,\dots,n\}$

cdf: (no simple expression)

Mean: $\mathbb{E}(X)=np$

Variance: $\mathrm{Var}(X)=np(1-p)$

mgf: $M_X(t)=[(1-p)+pe^t]^n,\space t\in\mathbb{R}$

$\binom{n}{x}=n!/[x!(n-x)!]$ is the binomial coefficient.

If $X_1,X_2,\dots,X_n$ are independent and identically distributed as Bernoulli(p), then the sum $\sum_i X_i \sim \text{Binomial}(n,p)$ .

Discrete distribution — Poisson(λ)

The Poisson distribution arises in two contexts: (1) Number of arrivals (occurrences) in a specific time period; (2) Approximation to
the binomial distribution. It has a single parameter $\lambda \gt 0$.

pmf: $p_X(x)=\frac{\lambda^xe^{-\lambda}}{x!},\space x\in \{0,1,2,\dots\}$

cdf: (no simple expression)

Mean: $\mathbb{E}(X)=\lambda$

Variance: $\mathrm{Var}(X)=\lambda$

mgf: $M_X(t)=e^{\lambda(e^t-1)},\space t\in\mathbb{R}$

$\lambda$ is known as the parameter. It can be shown that, if the arrival between any two events (interarrival time) is independently exponentially distributed with mean $1/\lambda$, then the number of arrivals by time 1 follows $\text{Poisson}(\lambda)$.

If $X_1,X_2,\dots,X_n$ are independent and each has a $\text{Poisson}(\lambda_i)$ distribution (i.e., the rate can be different for each $X_i$), then the sum $\sum_i X_i\sim\text{Poisson}(\sum_i\lambda_i)$

If the binomial parameters $n\rightarrow \infty,p\rightarrow 0$ but $ np\rightarrow \lambda $, then $\text{Binomial}(n,p)\rightarrow \text{Poisson}(\lambda)$. The approximation works well if $n\gt 100$ and $np\lt 10$.

Discrete distribution — NegBin(r, p) & Geom(p)

The negative binomial distribution models the number of failures before r successes are achieved, with trials independent of each other and having success probability $p\in(0,1]$

pmf: $p_X(x)=\binom{r+x-1}{x}p^r(1-p)^x,\space x\in \{0,1,2,\dots\}$ .

cdf: (no simple expression)

Mean: $\mathbb{E}(X)=r(1-p)/p$

Variance: $\mathrm{Var}(X)=r(1-p)/p^2$

mgf: $M_X(t)=[\frac{p}{1-(1-p)e^t}]^r,\space t\lt -\log(1-p)$

The special case of $r=1$ is known as the geometric distribution.

If $X_1,X_2,\dots,X_r$ are independent $\text{Geometric}(p)$ random variables, then $\sum_{i=1}^r X_i\sim \text{NegBin}(r,p)$

The geometric distribution is the only discrete distribution that is memoryless.

Continuous distribution — Uniform(a, b)

For real numbers $a$ and $b$ with $a\lt b$, the continuous uniform distribution has a constant density over $[a,b]$.

pdf: $f_X(x)=\frac{1}{b-a},\space x\in [a,b]$

cdf: $F_X(x)=\frac{x-a}{b-a},\space x\in[a,b]$

Mean: $\mathbb{E}(X)=\frac{a+b}{2}$

Variance: $\mathrm{Var}(X)=\frac{(b-a)^2}{12}$

mgf: $M_X{t}=\frac{e^{bt}-e^{at}}{(b-a)t},\space t\ne 0; M_X(0)=1$

Continuous distribution — Beta(α, β)

The beta distribution generalizes the uniform distribution on the [0, 1] interval. For parameters $\alpha,\beta\gt 0$, it has the fowwing quantities:

pdf: $f_X(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1},\space x\in[0,1]$

cdf: (no simple expression)

Mean: $\mathbb{E}(X)=\frac{\alpha}{\alpha+\beta}$

Variance: $\mathrm{Var}(X)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$

mgf: (no simple expression)

$\Gamma(t)=\int_0^{\infty}x^{t-1}e^{-x}\mathrm{d}x$ is the gamma function.
Note that $\Gamma(t)=(t-1)\Gamma(t-1)$ and $\Gamma(n)=(n-1)!$ for integral $n$.

The $\text{Beta}(1, 1)$ and $\text{Uniform}(0, 1)$ distributions are identical.

A beta distribution is left-skewed if $\alpha\gt\beta$ (large mean) and right-skewed if $\alpha\lt\beta$ (small mean).

If $X_1, \dots , X_n$ are independent $\text{Uniform}(0, 1)$ random variables, then the $k$th order statistic (i.e., $k$th smallest number among the $X_i$’s) has a $\text{Beta}(k, n + 1 − k)$ distribution.

Continuous distribution - Gamma(α, β) & Exponential(β)

The gamma distribution has connections with the sum of interarrival times mentioned above for the Poisson distribution.

For parameters $\alpha,\beta\gt 0$,

pdf: $f_X(x)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x},\space x\in [0,\infty)$

cdf: (no simple expression)

Mean: $\mathbb{E}(X)=\alpha / \beta$

Variance $\mathrm{Var}(X)=\alpha / \beta^2$

mgf: $M_X{t}=(\frac{\beta}{\beta-t})^{\alpha},\space t\lt \beta$

$\alpha$ is known as the shape parameter and $\beta$ the rate parameter.

The special case of $\alpha=1$ is known as the exponential distribution. The corresponding quantities are:

pdf: $f_X(x)=\beta e^{-\beta x},\space x\in [0,\infty)$

cdf: $F_X(x)=1- e^{-\beta x},\space x\in[0,\infty)$

Mean: $\mathbb{E}(X)=1/ \beta$

Variance $\mathrm{Var}(X)=1/ \beta^2$

mgf: $M_X{t}=\frac{\beta}{\beta-t},\space t\lt \beta$

If $X_1,\dots,X_n$ are independent $\mathrm{Exponential}(\beta)$ random variables, then $\sum_{i=1}^n X_i \sim\mathrm{Gamma}(n,\beta)$

If $X_1,\dots,X_n$ are independent $\mathrm{Gamma}(\alpha_i,\beta)$ random variables, then $\sum_i X_i \sim\mathrm{Gamma}(\sum_i \alpha_i,\beta)$

If $X\sim \mathrm{Gamma}(\alpha,\beta)$ , then $cX\sim \mathrm{Gamma}(\alpha,\beta/c)$ for any constant $c\gt 0$

The exponential distribution is the only continuous distribution that is memoryless, i.e., the distribution of $X-m$ given $X\ge m$ is the same exponential.

Continuous distribution - Chi-squared($\nu$)

The chi-squared( $\chi^2$ ) distribution has a single parameter $\nu$, known as the degree-of-freedom parameter.

pdf: $f_X(x)=\frac{1}{2^{\nu/2}\Gamma(\nu/2)}x^{\nu/2-1}e^{-x/2},\space x\in[0,\infty)$

cdf: (no simple expression)

Mean: $\mathbb{E}(X)=\nu$

Variance: $\mathrm{Var}(X)=2\nu$

mgf: $M_X(t)=(\frac{1}{1-2t})^{\nu/2},\space t\lt\frac{1}{2}$

The $\chi^2(\nu)$ distribution is equvalent to the $\mathrm{Gamma}(\nu/2,1/2)$ distribution.

If $X_1,\dots,X_n$ are indenenpent $\mathrm{Normal}(0,1)$ random variables, then $\sum_{i=1}^n X_i^2 \sim \chi^2(n)$.

If $X_1,\dots,X_n$ are indenenpent $\chi^2(\nu_i)$ random variables, then $\sum_i X_i^2 \sim \chi^2(\sum_i\nu_i)$. This follows from the same property of the gamma distribution.

Continuous distribution - Normal($\mu$,$\sigma^2$)

The normal distribution (or Gaussian distribution) is the cornerstone of statistics. For parameters $\mu$ and $\sigma^2\gt 0$, it has the following quantities:

pdf: $f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}exp{-\frac{(x-\mu)^2}{2\sigma^2}},\space x\in\mathbb{R}$

cdf: (no simple expression)

Mean: $\mathbb{E}(X)=\mu$

Variance: $\mathrm{Var}(X)=\sigma^2$

mgf: $M_X(t)=exp\{\mu t+\frac{1}{2}\sigma^2t^2\}$

The Normal(0,1) or N(0,1) distribution is known as the standard distrubution. Its cdf is often denoted by the Greek letter $\Phi$.

If $X\sim N(\mu,\sigma^2)$ , then $(X-\mu)/\sigma\sim N(0,1)$ . This is known as standardization.

If $X\sim N(\mu,\sigma^2)$ , then $aX+b\sim N(a\mu+b,a^2\sigma^2)$ for any constants $a,b$ .

If $X_1,\dots,X_n$ are independent $N(\mu_i,\sigma_i^2)$ random variables, then $\sum_i X_i\sim N(\sum_i\mu_i,\sum_i\sigma_i^2)$.

Normal approximation to the binomial: Binomial(n,p) can be approximated by N(np,np(1-p)) when n is large. In fact, as $n\rightarrow \infty$, we have

$\frac{Y-np}{\sqrt{np(1-p)}}\xrightarrow{\text{d}} N(0,1)$

Several other distributions also approach the normal in the limit.

Poisson($\lambda$) can be approximated by N($\lambda$,$\lambda$) if $\lambda$ is large.
Gamma($\alpha$,$\beta$) can be approximated by $N(\alpha/\beta,\alpha/\beta^2)$ if $\alpha$ is large.
NegBin(r,p) can be approximated by $N(r(1-p)/p,r(1-p)/p^2)$ if r is large.

Continuous distribution - honourable mention

In statistics you will often hear the t and F distributions. They are the results of combining independent variables mentioned above, for example:

$T=\frac{Z}{\sqrt{W/\nu}}$

gives a t-distributed random variable, where $Z\sim N(0,1),W\sim \chi^2(\nu)$ and $Z\perp\kern-5pt\perp W$, and

$F=\frac{X_1/\nu_1}{X_2/\nu_2}$

is distributed, where $X_1\sim\chi^2(\nu_1)$,$X_2\sim\chi^2(\nu_2)$ and $X_1\perp\kern-5pt\perp X_2$

Sampling and Estimation

Sampling distributions

A population is the complete set of items or events of interest. A sample is a subset of outcomes collected.
A random sample is a sequence of i.i.d. random variables from a
population distribution. Let $\{X_1, X_2, \dots , X_n \}$ denote a random
sample of size n.
For a random sample $\{X_1, X_2, \dots , X_n\}$ , the sample mean $\bar{X}$ and sample variance $S^2$ are respectively defined by $\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i, S^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2$
For a random sample $\{X_1, X_2, \dots , X_n\}$ from a population with mean $\mu$ and variance $\sigma^2$, $\mathbb{E}(\bar{X})=\mu, \mathrm{Var}(\bar{X})=\frac{\sigma^2}{n}$
The standard deviation (SD) of $\bar{X}$ , $\sigma/\sqrt{n}$, is known as the standard error (SE).
The SE is smaller with larger $n$ — this is intuitive as a larger sample will allow us to estimate $\mu$ more precisely.
For a random sample $\{X_1, X_2, \dots , X_n\}$ from a population with mean $\mu$ and variance $\sigma^2$,
$\mathbb{E}(S^2)=\sigma^2$
Sampling distributions for the normal distribution
For a random sample $\{X_1, X_2, \dots , X_n\}$ from the $N(\mu,\sigma^2)$ distribution,
1. $\bar{X}\sim N(\mu,\sigma^2/n)$
2. $(n-1)S^2/\sigma^2\sim \chi^2(n-1)$
3. $\bar{X}$ is independent of $S^2$
If $\bar{X}$ and $S^2$ are the size-n sample mean and variance of the $N(\mu,\sigma^2)$ distribution, then $\frac{\bar{X}-\mu}{S/\sqrt{n}}$ has the t-distribution with n-1 degrees of freedom, denoted as $t(n-1)$(or $t_{n-1}$)
Properties of the $t(\nu)$ distribution:
- pdf: $f_X(x)=\frac{\Gamma[(1+\nu)/2]}{\sqrt{\nu\pi}\Gamma(\nu/2)}(1+\frac{x^2}{\nu})^{-\frac{1+\nu}{2}},x\in \mathbb{R}$
- cdf: (no simple expression)
- Mean: $\mathbb{E}(X)=0$ if $\nu\gt 1$
- Variance: $Var(X)=\frac{\nu}{\nu-2}$ if $\nu\gt 2$
- mgf: (undefined)
The $100(1 − \alpha)\%$ confidence interval (CI) for $\mu$ based on a random
sample from a normal distribution is
$[\bar{x}\pm t_{n-1,\alpha/2} \cdot \frac{s}{\sqrt{n}}]$
where $\bar{x}$ is the (observed) sample mean, s is the (observed) sample SD, $n$ is the sample size and $t_{n-1,\alpha/2}$ is defined as the value such that $\mathbb{P}(T\gt t_{n-1,\alpha/2})=\alpha/2$ for $T\sim t(n-1)$ (i.e., the $(1-\alpha/2)$ quantile of $T$)

Large sample theory
Weak law of large numbers(WLLN)
For a random sample $\{X_1, X_2, \dots , X_n\}$ with finite (population) mean $\mu$,let $\bar{X_n}=\sum{i=1}^n X_i/n$ be the sample mean. The weak law of large numbers suggests that
$\lim_{n\rightarrow\infty}\mathbb{P}(|\bar{X_n}-\mu|\gt\epsilon)=0$
for any positive $\epsilon$. In other words, $\bar{X_n}$ converges in probability to $\mu$, written as $\bar{X_n}\overset{P}{\rightarrow}X$
Strong law of large numbers(SLLN)
For a random sample $\{X_1, X_2, \dots , X_n\}$ with finite (population) mean $\mu$,let $\bar{X_n}=\sum{i=1}^n X_i/n$ be the sample mean. The strong law of large numbers suggests that $\mathbb{P}(\lim_{n\rightarrow\infty}\bar{X_n}=\mu)=1$ In other words, $\bar{X_n}$ converges almost surely to $\mu$, written as $\bar{X_n}\overset{a.s.}{\rightarrow}X$
Central limit theorem
For a random sample $\{X_1, X_2, \dots , X_n\}$ with finite (population) mean $\mu$ and variace $\sigma^2$,let $\bar{X_n}=\sum_{i=1}^n X_i/n$ be the sample mean. The central limit theorem suggests that
$\lim_{n\rightarrow\infty}\mathbb{P}(\frac{\bar{X_n}-\mu}{\sigma/\sqrt{n}}\le x)=\Phi(x)$
pointwise. In other words, $(\bar{X_n}-\mu)/(\sigma/\sqrt{n})$ converges in distribution to a standard normal random variable, written as $(\bar{X_n}-\mu)/(\sigma/\sqrt{n})\overset{d}{\rightarrow}N(0,1)$

Point estimation - properties of estimators
Let the parameter be denoted as $\theta$. The point estimator of $\theta$, usually denoted as $\hat{\theta_n}$ where $n$ is the sample size, is a sample statistic used to estimate $\theta$. The realized value of $\hat{\theta_n}$ is called the point estimate.
Bias
The bias of an estimator $\hat{\theta_n}$ is given by $Bias(\hat{\theta_n})=\mathbb{E}(\hat{\theta_n})-\theta$ If the bias is zero for every possible value of $\theta$, the estimator is unbiased.
If the bias is not zero but tend to zero as $n\rightarrow \infty$, the estimator is asymptotically unbiased.
An estimator tend to underestimate the true value if $\mathbb{E}(\hat{\theta_n})\lt\theta$ and overestimate the true value if $\mathbb{E}(\hat{\theta_n})\gt\theta$
Mean squared error
The mean squared error(MSE) of an estimator $\hat{\theta_n}$ is given by $MSE(\hat{\theta_n})=\mathbb{E}[(\hat{\theta_n}-\theta)^2]=Var(\hat{\theta_n})+[Bias(\hat{\theta_n})]^2$ To achieve a low MSE, an estimator needs to be accurate (close to the true value) and precise (with little variability).
For an unbiased estimator, the MSE is equal to $Var(\hat{\theta_n})$
Efficiency
For two unbiased estimators $\hat{\theta_{1,n}}$ and $\hat{\theta_{2,n}}$ of $\theta$ , $\hat{\theta_{1,n}}$ is said to be more efficient than $\hat{\theta_{2,n}}$ if $Var(\hat{\theta_{1,n}})\le Var(\hat{\theta_{2,n}})$ for all possible values of the true parameter $\theta$.
If either estimator is biased, it is better to make comparisons via the MSE since it also takes into account the magnitude of the bias.
An unbiased estimator that has the smallest variance among all other unbiased estimators for all $\theta$ is known as the uniformly minimum variance unbiased estimator, or UMVUE.
Consistency
An estimator $\hat{\theta_n}$ is consistent for $\theta$, if for every $\epsilon\gt0$ we have $\lim_{n\rightarrow\infty}\mathbb{P}(|\hat{\theta_n}-\theta|<\epsilon)=1$ In other words, $\hat{\theta_n}\overset{p}{\rightarrow}\theta$ as $n\rightarrow\infty$ if it is consistent.
This definition is often hard to check. A useful workaround(sufficient but no necessary condition) is that if $\lim_{n\rightarrow\infty}MSE(\hat{\theta_n})=0$ then $\hat{\theta_n}$ is consistent for $\theta$.
Point estimation -methods
The method of moments estimates parameters by equating the sample raw moments to the raw moments of the target distribution. They are defined as:
Sample $r$ th raw mooment: $m_r:=\frac{1}{n}\sum_{i=1}^n X_i^r$
$r$ th raw moment of the distribution: $\mathbb{E}(X^r)$
Method of maximum likelihood
The method of maximum likelihood considers the pdf/pmf as a likelihood function that is maximized. Some definitions are in order:
Suppose random variables $X_1,\dots,X_n$ have joint pdf or pmf $f_X(x_1,\dots,x_n;\theta)$, where $\theta$ is a collection of parameters. The likelihood function L is simply $f_X(x_1,\dots,x_n;\theta)$, but viewing it as a function of $\theta$ with $x_1,\dots,x_n$ fixed at their observed values. That is,
$L(\theta;x_1,\dots,x_n)=f_X(x_1,\dots,x_n;\theta)$
The log-likelihood function $\ell$ is the (natural) logarithm of $L$ , i.e., $\ell(\theta;x_1,\dots,x_n)=\log L(\theta;x_1,\dots,x_n)=\log f_X(x_1,\dots,x_n;\theta)$
Note that $X_1,\dots,X_n$ need not be independent.
The maximun likelihood estimator(MLE) $\hat{\theta_{ML}}$ of a parameter $\theta$ is the value of $\theta$ that maximizes the likelikood (or log-likelihood) function, that is,
$\hat{\theta_{ML}}=\arg\max_{\theta}L(\theta;x_1,\dots,x_n)$
If $\hat{\theta_{ML}}$ is the MLE of $\theta$ , then $g(\hat{\theta_{ML}})$ is the MLE of $g(\theta)$ , a function of $\theta$ .

Interval estimation
For a random sample $\{X_1, X_2, \dots , X_n\}$ used to estimate an unknown parameter $\theta$, let $L(X)$ and $U(X)$ be some functions of the random sample with
$\mathbb{P}\{L(X)\le\theta\le U(X)\}=1-\alpha$
where $1-\alpha$ is typically a high probability. The interval $[L(X),U(X)]$ is known as a $100(1-\alpha)\%$ confidence interval(CI) for the parameter $\theta$.
The following is a general recipe for finding CI’s:
1. Establish a pivotal quantity. A pivotal quantity is a function of the random sample and model parameters that has a distribution not involving $\theta$, written as $V(X\theta)$.
2. Find some constants a,b such that $\mathbb{P}(a\le V(X,\theta)\le b)=1-\alpha$ Because the distribution of $V(X,\theta)$ does not depend on $\theta$, the constants a and b will also be free of $\theta$.
3. Solve $a\le V(X,\theta)$ and $b\gt V(X,\theta)$ for $\theta$. This will give a lower limit $L(X)$ and an upper limit $U(X)$ such that $\mathbb{P}{L(X)\le\theta\le U(X)}=1-\alpha$ The required CI is given by $[L(X),U(X)]$.
Interval estimation for means - $N(\mu,\sigma^2)$
The pivotal quantity $\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0,1)$
The CI for the population mean is given by $[\bar{X}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}]$
Interval estimation for means - $N(\mu,?)$
The pivotal quantity $\frac{\bar{X}-\mu}{S/\sqrt{n}}\sim t(n-1)$
The CI for the population mean is given by $[\bar{X}\pm t_{n-1,\alpha/2}\frac{S}{\sqrt{n}}]$
Interval estimation for means - $N(\mu_1,\sigma_1^2)$ vs $N(\mu_2,\sigma_2^2)$
The pivotal quantity is $\frac{(\bar{X}-\bar{Y})-(\mu_1-\mu_2)}{\sqrt{\sigma_1^2/m+\sigma_2^2/n}}\sim N(0,1)$
The CI for the difference in means is given by $(\bar{X}-\bar{Y})\pm z_{\alpha/2}\sqrt{\sigma_1^2/m+\sigma_2^2/n}$
Interval estimation for means - $N(\mu_1,?)$ vs $N(\mu_2,?)$
The pivotal quantity is $\frac{(\bar{X}-\bar{Y})-(\mu_1-\mu_2)}{\sqrt{S_p^2/m+S_p^2/n}}\sim t(m+n-2)$ where $s_p^2=\frac{(m-1)S_1^2+(n-1)S_2^2}{m+n-2}$
The CI for the difference in means is given by $(\bar{X}-\bar{Y})\pm t_{m+n-2,\alpha/2}\sqrt{S_p^2/m+S_p^2/n}$
Interval estimation for means - $N(\mu_1,?_1)$ vs $N(\mu_2,?_2)$
The pivotal quantity is $\frac{(\bar{X}-\bar{Y})-(\mu_1-\mu_2)}{\sqrt{S_1^2/m+S_2^2/n}}\overset{\cdot}\sim t(\nu)$ The number of degrees of freedom is $\nu=\frac{(S_1^2/m+S_2^2/n)^2}{\frac{S_1^4}{m^2(m-1)}+\frac{S_2^4}{n^2(n-1)}}$ The CI for the difference in means is given by $(\bar{X}-\bar{Y})\pm t_{\nu,\alpha/2}\sqrt{S_1^2/m+S_2^2/n}$
Interval estimation for variances - N(?,?)
The pivotal quantity is $\frac{(n-1)S^2}{\sigma^2}\sim \chi^2(n-1)$
The CI for the population variance is given by $[\frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}},\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}}]$
Interval estimation for variances - $N(?,?_1)$ vs $N(?,?_2)$
The pivotal quantity is $\frac{[(m-1)S_1^2/\sigma_1^2]/(m-1)}{[(n-1)S_2^2/\sigma_2^2]/(n-1)}=\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}\sim F_{m-1,n-1}$ The CI for the ratio of variances $\sigma_1^2/\sigma_2^2$ is given by $[\frac{1}{F_{m-1,n-1,\alpha/2}}\frac{S_1^2}{S_2^2},F_{n-1,m-1,\alpha/2}\frac{S_1^2}{S_2^2}]$
Hypothesis Testing

Introduction to hypothesis testing

In statistics, a hypothesis test makes a decision between two mutually exclusive statements about the population, known as hypotheses.
The null hypothesis, denoted as $H_0$, is a statement that has an established standing or the “standard” that is being put to the test.
The alternative hypothesis, denoted as $H_1$ or $H_a$ , is a statement that challenges the null hypothesis.
A simple hypothesis is one in which the hypothesis statement completely determines a single distribution. Otherwise, it is known as a composite hypothesis.
A test statistic is a function of the observed data that is used to construct the condition of a hypothesis test, based on which a decision is made.
The rejection (or critical) region is the range of values of the test statistic that, if observed, will lead to the rejection of $H_0$ (and acceptance of $H_1$ ).
The critical value(s) demarcates the rejection region.
The probability of making a type I error, $P(\mathrm{reject}\quad H_0 | H_0 \mathrm{\quad is\quad true})$ , is the type I error rate ( $\alpha$ ); it is also known as the significance level.
The probability of making a type II error, $P(\mathrm{accept}\quad H_0 | H_1 \mathrm{\quad is\quad true})$ , is the type II error rate ( $\beta$ ). One minus this probability gives the power of the test.
The power function of a statistical test gives the probability of rejecting $H_0$ as a function of the true parameter value: $\pi(\theta)=\mathbb{P}(\mathrm{reject} H_0 \mid \theta)$
Note that if $\theta \in \Theta_0$ , the parameter space of $H_0$ , then $\pi(\theta)=\mathbb{P}(\mathrm{reject} H_0 \mid \theta)$ gives the type I error rate.
If $H_0$ is a composite hypothesis, then we define the size of the test as the maximum possible value of $\pi(\theta)$ for all $\theta \in \Theta_0$ .
A test has significance level $\alpha$ if its size is at most $\alpha$ . The significance level and size are equal in many cases.
The power of a test is the probability of not making a type II error.When $H_1$ is composite, the power at $\theta = \theta_1$ is simply $\pi(\theta_1)$ .
The p-value of a statistical test is the probability of observing a value of the test statistic at least as inconsistent as implied by $H_0$ , if $H_0$ is true. The test rejects $H_0$ if the p-value is less than the significance level.

General steps in hypothesis testing

Formulate a statistical model (distribution if parametric).
Specify the null and alternative hypotheses.
Determine a test statistic $T$ . It is typically one with a nice distribution under $H_0$ , so that the significance level can be easily obtained.
Determine the significance level $\alpha$ .
Collect data and calculate the test statistic. (Note: You must specify the significance level prior to data analysis — no cheating!)
[Rejection region approach] Find the rejection region of $T$ that corresponds to the selected $\alpha$ .
[p-value approach] Calculate the p-value corresponding to the observed test statistic.
If the observed test statistic is in the rejection region (or the p-value is less than $\alpha$ ), you reject $H_0$ and accept $H_1$ . Otherwise you do not reject $H_0$ .

Duality between hypothesis tests and confidence intervals

The CI is the set of $\mu_0$ under which $H_0$ is not rejected.
Equivalently, if the hypothesized $\mu_0$ is not sufficiently far away from $\bar{X}$ in the sense that it lies in the CI, the test will not reject $H_0$ .
This is known as the duality between hypothesis tests and CI’s.

Notes for STAT_PROB_2024 Review Course on Probability and Statistics [2024]

Basic Probability Theory

Introduction

Sets, events and probability

Sets and events

Probability of events

Conditional probability

Probabilities conditional on given information

Calculating conditional probabilities

Independent events

Bayes’ rule

Random variables and distributions

Random variables

More on Distributions and Moments

Bivariate distributions

Expectations and moments

Mathematical expectations

Some common distributions

Discrete distribution — Uniform(a, b)

Discrete distribution — Bernoulli(p)

Discrete distribution — Binomial(n,p)

Discrete distribution — Poisson(λ)

Discrete distribution — NegBin(r, p) & Geom(p)

Continuous distribution — Uniform(a, b)

Continuous distribution — Beta(α, β)

Continuous distribution - Gamma(α, β) & Exponential(β)

Continuous distribution - Chi-squared($\nu$)

Continuous distribution - Normal($\mu$,$\sigma^2$)

Continuous distribution - honourable mention

Sampling and Estimation

Sampling distributions

Sampling distributions for the normal distribution

Large sample theory

Point estimation - properties of estimators

Point estimation -methods

Interval estimation

Hypothesis Testing

Introduction to hypothesis testing

Duality between hypothesis tests and confidence intervals