Basic Probability Theory

Introduction

We will review the following topics:

  1. Basic probablilty theory: Set; event; probability; Bayes’ rule; random variable; density/mass/distribution functions.
  2. Distributions and moments: Joint/marginal/conditional distributions; moment and generating functions; covariance and correlation; iterated moments; common parametric distributions.
  3. Sampling and estimation: Sample mean and variance; large sample theories; properties of point estimators; method of moments; maximum likelihood; interval estimation and confidence interval.
  4. Hypothesis testing: Null and composite hypotheses; test statistic; type I/II errors; power functions; p-value; duality principle; hypothesis tests regarding means, variances and proportions.

Sets, events and probability

Sets and events

  1. A sample space is a set containing all possible outcomes.
    An event is a subset of the sample space.An empty event is the empty set.
  2. For two or more sets, the intersection operator $\cap$ extracts elements common to both sets.
    The intersection of sets cannot have more elements than the individual sets.
  3. For two or more sets, the union operator $\cup$ combines elements from
    both sets.
    The union of sets cannot have fewer elements than the individual sets.
  4. Two sets $E_1$ and $E_2$ are disjoint if $E_1 \cap E2 = \emptyset$ (nothing in common).
  5. The sets $E_1,E_2,\dots,E_n$:
    • are mutually exclusive if $E_i \cap E_j = \emptyset$ for all $i \ne j$
    • are exhaustive if $E_1 \cup \dots \cup E_n = \Omega$, (make up the sample space);
    • form a partition if the above two properties are true.
  6. The complement of a set $E$ is a set that contains all elements not in $E$, denoted as $E^c$.

Probability of events

  1. The probability operator $\mathbb{P}$ assigns a number to each event to denote its “likelihood” of happening.
    • $\mathbb{P}(\emptyset) = 0, \mathbb{P}(\Omega) = 1$;
    • $\mathbb{P}(E) \ge 0$ for any event $E$;
    • $\mathbb{P}(E) + \mathbb{P}(E^c) = 1$;
    • For mutually exclusive events $E_1,\dots,E_n, \mathbb{P}$ is additive, i.e.,
  2. The inclusion-exclusion formula is extremely useful to convert the probabilities of $\cup$ to $\cap$, and vice versa:We typically use it for 2 or 4 events, i.e.,

Conditional probability

Probabilities conditional on given information

Sometimes, we deal with probabilities on the condition that we know something in advance.

Calculating conditional probabilities

  1. Definition: Conditional probability
    For two events $A$ and $B$, the conditional probability of event $A$ given the occurrence of event $B$ is written as $\mathbb{P}(A|B)$, calculated asif $\mathbb{P}(B)>0$.
    This example simply reinstates that $\mathbb{P}(A\cap B)= \mathbb{P}(B)\mathbb{P}(A|B)$
  2. Theorem: Law of total probability
    For events $B_1,\dots, B_n$ that form a partition (i.e., mutually exclusive &
    exhaustive) and event $A$,
  3. Theorem: Multiplication rule
    For events $B_1,\dots,B_n$

Independent events

  1. Two events $A$ and $B$ are said to be independent (written as $A \perp\kern-5pt\perp B$) if $P(A) = P(A|B)$, i.e., occurrence of $B$ does not affect the chances of $A$ happening. This implies $\mathbb{P}(A \cap B) = \mathbb{P}(A) \mathbb{P}(B)$
    This can be extended to more than two events: Events $A_1,\dots,A_n$ are mutally independent if and only iffor every combination of $k_1\ne k_2\ne\dots\ne k_m$ and $m\le n$

Bayes’ rule

  1. Theorem: Bayes’ rule
    For two events $A$ and $B$ with $\mathbb{P}(A)\gt 0$ and $\mathbb{P}(B)\gt 0$$\mathbb{P}(B)$ is known as the prior probability and $\mathbb{P}(B|A)$ the posterior probability.
    $\mathbb{P}(B)=\mathbb{P}(B|A)\iff \mathbb{P}(A)=\mathbb{P}(A|B)$,i.e., $A$ and $B$ are independent and thus $A$ adds no information on $B$.
  2. In general, if $B_1,\dots,B_n$ constitute a partition, then

Random variables and distributions

Random variables

  1. A random variable is a function that maps each element of the sample space to a real number.

    A random variable is realized when we observe its value. We typically use capital letters (e.g., $X$, $Y$) to denote random variables and small letters (e.g., $x$, $y$) to denote their realizations.

    There are three main types of random variables — discrete, continuous, and mixed.

  2. Discrete random variables — pmf
    A random variable $X$ is discrete if it can only take on a countable (possibly countably infinite) number of values.

    Definition: Probability mass function
    For a discrete random variable $X$, the probability mass function or pmf is defined as

    for $x\in X(\Omega)$, where $X(\Omega)$ is the set of all possible values of $X$
    A valid pmf has the following properties:

    • $p_X(x)\ge 0$ for all $x$
    • For any subset $A\subset X(\Omega)$,$\mathbb{P}(X\in A)=\sum_{x\in A}p_X(x)$.
  3. Discrete random variables — cdf
    Definition: Cumulative distribution function
    For a discrete random variable $X$, the cumulative distribution function or cdf is defined asfor $x \in \mathbb{R}$.It is often shortened as the distribution function of $X$.
    A valid cdf has the following properties:
    • $F_X(a)\le F_X(b)$ if $a\le b$
    • $F_X(x)$ is right-continuous
  4. Continuous random variables — pdf
    Definition: Continuous random variable
    A random variable $X$ is (absolutely) continuous if there exists a non-negative function $f$ defined on the real line such that

    for every $a\le b$, The function $f(x)$ is known as the probability density function(pdf) of $X$.

    A valid pdf has the following properties:

    • $f(x)\ge 0$ for all $x$.
    • $\int_A f(x)\mathrm{d}x=\mathbb{P}(X\in A)$ where $A$ is any subset of $\mathbb{R}$
  5. Continuous random variables — cdf
    Definition: Cumulative distribution function
    For a continuous random variable $X$, the (cumulative) distribution function or cdf is defined asfor $x\in\mathbb{R}$.
  6. Mixed random variables
    This is known as a mixed random variable which has probability masses at some locations and densities at other locations.

More on Distributions and Moments

Bivariate distributions

  1. Joint distributions
    • Definition: Joint cumulative distribution function for 2 variables
      For two random variables $X$ and $Y$ , the joint (cumulative) distribution function or joint cdf is defined asfor $(x,y)\in \mathbb{R}^2$
    • Definition: Joint probability mass function $X$ and $Y$ are jointly discrete if there exists a joint probability mass function (joint pmf) such that
    • Definition: Joint probability density function $X$ and $Y$ are jointly continuous if there exists a joint probability density function (joint pdf) such that
  2. Marginal distributions
    • Definition: Marginal pmf/pdf
      For $X$ and $Y$ jointly discrete, the marginal pmf of $X$ and $Y$ are respectively given byFor $X$ and $Y$ jointly continuous, the marginal pdf of $X$ and $Y$ are respectively given by
  3. Conditional distributions
    • Definition: Conditional pmf/pdf
      For $X$ and $Y$ jointly discrete, the conditional pmf of $Y$ given $X$ isif $p_X(x)\gt 0$.
      For $X$ and $Y$ jointly continuous, the conditional pdf of Y given X isif $f_X(x)\gt 0$.
    • The conditional cdf can be obtained from the conditional pmf/pdf:
  4. Independence of random variables
    • Definition: Independent random variables
      Two random variables $X$ and $Y$ are independent if and only if the joint pmf/pdf is equal to the product of the marginal pmf’s/pdf’s, i.e.,for all possible values of $x$ and $y$.
    • The above definition also works for cdf’s, i.e.,
      $F_{X,Y}(x,y) = F_X(x) F_Y(y)$ for all $x$ and $y$.

Expectations and moments

Mathematical expectations

  1. Definition: Expectation
    • The expectation of a random variable $X$ (written as $\mathbb{E}(X)$) is defined as
    • The expectation of g(X), a (known) function of X, can be defined similarly:
    • We say that the expectation (of $X$ or $g(X)$) does not exist if the sum or integral diverges.
    • Properties of the expectation operator:
      1. $\mathbb{E}(aX+b)=a\mathbb{E}(X)+b$ for any constants $a$, $b$ and random variable $X$. We call $\mathbb{E}$ a linear operator.
      2. If $X\le Y$ for all possible outcomes in the sample space, then $\mathbb{E}(X) \le \mathbb{E}(Y)$.
      3. If $\mathbb{E}|X^a|$ exists for some $a \gt 0$, then $\mathbb{E}|X^b|$ exists for all $0 \lt b \lt a$. This also implies the existence of $\mathbb{E}(X^b)$.
  2. Moments
    • Definition: Moments
      The $n$th raw moment of a random variable $X$ is defined asif it exists. The first raw moment is also known as the mean of $X$, often denoted by $\mu$.
      The $n$th central moment of a random variable $X$ is defined asif it exists.
  3. Means, variances and standard deviations
    • Raw moments are moments about the origin; central moments are moments about the mean.
    • Definition: Summary measures of a distribution
      The mean of $X$,$\mu$ (or $\mathbb{E}(X)$), measures the central tendency of $X$.
      The second central moment, $\mathbb{E}[(X-\mu)^2]$, is denoted by $\sigma^2$ or $\mathrm{Var}(X)$ and is known as the variance of $X$.
      The square root of $\sigma^2$, $\sigma$, is known as the standard deviation of $X$ and has the same unit as $X$. Both $\sigma$ and $\sigma^2$ measure the dispersion(spread) of $X$ about the mean.
    • The variance is equal to the second raw moment minus the square of the mean:
    • $\mathrm{Var}(a X + b)=a^2 \mathrm{Var}(X)$ for any constants $a$, $b$ and random variable $X$.
  4. Higher moments
    • Definition: Skewness (third moment)
      The third central moment provides a measure of the skewness (asymmetry) of the distribution. The coefficient of skewness is defined asIt is positive if the distribution is right-skewed, and negative if it is left-skewed.
    • Definition: Kurtosis (fourth moment)
      The fourth central moment provides a measure of the kurtosis (tailedness) of the distribution. The coefficient of kurtosis is defined asA leptokurtic distribution has fat tails (kurtosis > 3), and a platykurtic distribution has thin tails (kurtosis < 3).
  5. Moment-generating functions
    Definition: Moment-generating function
    The moment-generating function (mgf) of a random variable $X$ is defined asif it exists, with $t$ the argument of the mgf. It is possible that $M_X(t)$ is finite only on a subset of $\mathbb{R}$.
    This function is “moment-generating” in the sense that we can obtain moments from it:From the definition, we obtain that
    $M_{aX+b}(t) = \mathbb{E}[e^{t(aX+b)}] = e^{bt}\mathbb{E}(e^{atX}) = e^{bt}M_X(at)$ for constants $a,b$ , if $M_X(at)$ exists.
  6. Moments of functions of random variables
    For a function $g(X,Y)$ of two variables, the expectation is equal to Note the following useful properties, where $a,b$ are constants and $g,h$ are (known) functions.
    1. $\mathbb{E}[a\cdot g(X) + b \cdot h(Y)] = a\mathbb{E}[g(X)]+b\mathbb{E}[h(Y)]\mathrm{(linearity)}$
    2. If $X$ and $Y$ are independent, then $\mathbb{E}[g(X)\cdot h(Y)]=\mathbb{E}[g(X)]\cdot \mathbb{E}[h(Y)]$
    3. If $X$ and $Y$ are independent, then $M_{aX+bY}(t)=M_X(at)M_Y(bt)$
  7. Covariance and correlation
    Definition: Covariance and correlation
    For random variables $X,Y$ with means $\mu_X,\mu_Y$ and standard deviations $\sigma_X,\sigma_Y$, the covariance is defined by

    The correlation coefficient is defined by

    It can be shown that for any $X,Y$ such that $\rho_{XY}$ exists.

    Random variables having positive (negative) covariance/correlation coefficient are known as positively (negatively) correlated.

    Some properties of the covariance/correlation of two random variables:

    1. For independent variables, . The reverse is not true!
    2. $Var(X)=Cov(X,X)$.
    3. $Cov(X,Y)=Cov(Y,X)$.
    4. $Cov(aX+b,cY+d)=Cov(aX,cY)=acCov(X,Y)$.
    5. $Var(aX+bY)=a^2Var(X)+b^2Var(Y)+2abCov(X,Y)$.
    6. More generally, .
    7. $Cor(aX+b,cY+d)=sign(ac)Cor(X,Y)$,where $sign(ac)=-1$ if $ac\lt 0$ and 1 if $ac\gt 0$.
  8. Conditional expectationsThis is known as the conditional mean. The conditional variance can be computed as $\mathbb{E}(Y^2|X=x)-[\mathbb{E}(Y|X=x)]^2$.
  9. Iterated moments
    Theorem: Law of total expectation/variance
    For random variables $X,Y$, we have
    ;
    .

Some common distributions

Discrete distribution — Uniform(a, b)

For integers $a$ and $b$ with $a \le b$, the discrete uniform distribution puts equal point masses at $a, a + 1, a + 2, \dots , b$.

pmf:

cdf:

Mean:

Variance:

mgf: (omitted)

Discrete distribution — Bernoulli(p)

The Bernoulli distribution models the number of successes of a single trial with success probability $p\in [0,1]$

pmf:

cdf:

Mean:

Variance:

mgf:

Discrete distribution — Binomial(n,p)

The Binomial distribution models the number of successes of n independent trials, each with success probability $p \in [0, 1]$.

pmf:

cdf: (no simple expression)

Mean:

Variance:

mgf:

$\binom{n}{x}=n!/[x!(n-x)!]$ is the binomial coefficient.

If are independent and identically distributed as Bernoulli(p), then the sum .

Discrete distribution — Poisson(λ)

The Poisson distribution arises in two contexts: (1) Number of arrivals (occurrences) in a specific time period; (2) Approximation to
the binomial distribution. It has a single parameter $\lambda \gt 0$.

pmf:

cdf: (no simple expression)

Mean:

Variance:

mgf:

$\lambda$ is known as the parameter. It can be shown that, if the arrival between any two events (interarrival time) is independently exponentially distributed with mean $1/\lambda$, then the number of arrivals by time 1 follows $\text{Poisson}(\lambda)$.

If $X_1,X_2,\dots,X_n$ are independent and each has a $\text{Poisson}(\lambda_i)$ distribution (i.e., the rate can be different for each $X_i$), then the sum $\sum_i X_i\sim\text{Poisson}(\sum_i\lambda_i)$

If the binomial parameters $n\rightarrow \infty,p\rightarrow 0$ but $ np\rightarrow \lambda $, then $\text{Binomial}(n,p)\rightarrow \text{Poisson}(\lambda)$. The approximation works well if $n\gt 100$ and $np\lt 10$.

Discrete distribution — NegBin(r, p) & Geom(p)

The negative binomial distribution models the number of failures before r successes are achieved, with trials independent of each other and having success probability $p\in(0,1]$

pmf: .

cdf: (no simple expression)

Mean:

Variance:

mgf:

The special case of $r=1$ is known as the geometric distribution.

If are independent random variables, then

The geometric distribution is the only discrete distribution that is memoryless.

Continuous distribution — Uniform(a, b)

For real numbers $a$ and $b$ with $a\lt b$, the continuous uniform distribution has a constant density over $[a,b]$.

pdf:

cdf:

Mean:

Variance:

mgf:

Continuous distribution — Beta(α, β)

The beta distribution generalizes the uniform distribution on the [0, 1] interval. For parameters $\alpha,\beta\gt 0$, it has the fowwing quantities:

pdf:

cdf: (no simple expression)

Mean:

Variance:

mgf: (no simple expression)

$\Gamma(t)=\int_0^{\infty}x^{t-1}e^{-x}\mathrm{d}x$ is the gamma function.
Note that $\Gamma(t)=(t-1)\Gamma(t-1)$ and $\Gamma(n)=(n-1)!$ for integral $n$.

The $\text{Beta}(1, 1)$ and $\text{Uniform}(0, 1)$ distributions are identical.

A beta distribution is left-skewed if $\alpha\gt\beta$ (large mean) and right-skewed if $\alpha\lt\beta$ (small mean).

If $X_1, \dots , X_n$ are independent $\text{Uniform}(0, 1)$ random variables, then the $k$th order statistic (i.e., $k$th smallest number among the $X_i$’s) has a $\text{Beta}(k, n + 1 − k)$ distribution.

Continuous distribution - Gamma(α, β) & Exponential(β)

The gamma distribution has connections with the sum of interarrival times mentioned above for the Poisson distribution.

For parameters $\alpha,\beta\gt 0$,

pdf:

cdf: (no simple expression)

Mean:

Variance

mgf:

$\alpha$ is known as the shape parameter and $\beta$ the rate parameter.

The special case of $\alpha=1$ is known as the exponential distribution. The corresponding quantities are:

pdf:

cdf:

Mean:

Variance

mgf:

If are independent random variables, then $\sum_{i=1}^n X_i \sim\mathrm{Gamma}(n,\beta)$

If are independent random variables, then $\sum_i X_i \sim\mathrm{Gamma}(\sum_i \alpha_i,\beta)$

If , then for any constant $c\gt 0$

The exponential distribution is the only continuous distribution that is memoryless, i.e., the distribution of $X-m$ given $X\ge m$ is the same exponential.

Continuous distribution - Chi-squared($\nu$)

The chi-squared() distribution has a single parameter $\nu$, known as the degree-of-freedom parameter.

pdf:

cdf: (no simple expression)

Mean:

Variance:

mgf:

The distribution is equvalent to the $\mathrm{Gamma}(\nu/2,1/2)$ distribution.

If are indenenpent random variables, then $\sum_{i=1}^n X_i^2 \sim \chi^2(n)$.

If are indenenpent random variables, then $\sum_i X_i^2 \sim \chi^2(\sum_i\nu_i)$. This follows from the same property of the gamma distribution.

Continuous distribution - Normal($\mu$,$\sigma^2$)

The normal distribution (or Gaussian distribution) is the cornerstone of statistics. For parameters $\mu$ and $\sigma^2\gt 0$, it has the following quantities:

pdf:

cdf: (no simple expression)

Mean:

Variance:

mgf:

The Normal(0,1) or N(0,1) distribution is known as the standard distrubution. Its cdf is often denoted by the Greek letter $\Phi$.

If , then . This is known as standardization.

If , then for any constants .

If $X_1,\dots,X_n$ are independent $N(\mu_i,\sigma_i^2)$ random variables, then $\sum_i X_i\sim N(\sum_i\mu_i,\sum_i\sigma_i^2)$.

Normal approximation to the binomial: Binomial(n,p) can be approximated by N(np,np(1-p)) when n is large. In fact, as $n\rightarrow \infty$, we have

Several other distributions also approach the normal in the limit.

  • Poisson($\lambda$) can be approximated by N($\lambda$,$\lambda$) if $\lambda$ is large.
  • Gamma($\alpha$,$\beta$) can be approximated by $N(\alpha/\beta,\alpha/\beta^2)$ if $\alpha$ is large.
  • NegBin(r,p) can be approximated by $N(r(1-p)/p,r(1-p)/p^2)$ if r is large.

Continuous distribution - honourable mention

In statistics you will often hear the t and F distributions. They are the results of combining independent variables mentioned above, for example:

gives a t-distributed random variable, where $Z\sim N(0,1),W\sim \chi^2(\nu)$ and $Z\perp\kern-5pt\perp W$, and

is distributed, where $X_1\sim\chi^2(\nu_1)$,$X_2\sim\chi^2(\nu_2)$ and $X_1\perp\kern-5pt\perp X_2$

Sampling and Estimation

Sampling distributions

  • A population is the complete set of items or events of interest. A sample is a subset of outcomes collected.
  • A random sample is a sequence of i.i.d. random variables from a
    population distribution. Let denote a random
    sample of size n.
  • For a random sample , the sample mean $\bar{X}$ and sample variance $S^2$ are respectively defined by
  • For a random sample from a population with mean $\mu$ and variance $\sigma^2$,
  • The standard deviation (SD) of $\bar{X}$ , $\sigma/\sqrt{n}$, is known as the standard error (SE).
  • The SE is smaller with larger $n$ — this is intuitive as a larger sample will allow us to estimate $\mu$ more precisely.
  • For a random sample from a population with mean $\mu$ and variance $\sigma^2$,

    Sampling distributions for the normal distribution

  • For a random sample from the $N(\mu,\sigma^2)$ distribution,

    1. $\bar{X}\sim N(\mu,\sigma^2/n)$
    2. $(n-1)S^2/\sigma^2\sim \chi^2(n-1)$
    3. $\bar{X}$ is independent of $S^2$
  • If $\bar{X}$ and $S^2$ are the size-n sample mean and variance of the $N(\mu,\sigma^2)$ distribution, thenhas the t-distribution with n-1 degrees of freedom, denoted as $t(n-1)$(or $t_{n-1}$)
  • Properties of the $t(\nu)$ distribution:
    • pdf:
    • cdf: (no simple expression)
    • Mean: $\mathbb{E}(X)=0$ if $\nu\gt 1$
    • Variance: $Var(X)=\frac{\nu}{\nu-2}$ if $\nu\gt 2$
    • mgf: (undefined)
  • The $100(1 − \alpha)\%$ confidence interval (CI) for $\mu$ based on a random
    sample from a normal distribution is

    where $\bar{x}$ is the (observed) sample mean, s is the (observed) sample SD, $n$ is the sample size and is defined as the value such that $\mathbb{P}(T\gt t_{n-1,\alpha/2})=\alpha/2$ for $T\sim t(n-1)$ (i.e., the $(1-\alpha/2)$ quantile of $T$)

    Large sample theory

  • Weak law of large numbers(WLLN)
    For a random sample with finite (population) mean $\mu$,let $\bar{X_n}=\sum{i=1}^n X_i/n$ be the sample mean. The weak law of large numbers suggests that

    for any positive $\epsilon$. In other words, $\bar{X_n}$ converges in probability to $\mu$, written as $\bar{X_n}\overset{P}{\rightarrow}X$

  • Strong law of large numbers(SLLN)
    For a random sample with finite (population) mean $\mu$,let $\bar{X_n}=\sum{i=1}^n X_i/n$ be the sample mean. The strong law of large numbers suggests thatIn other words, $\bar{X_n}$ converges almost surely to $\mu$, written as $\bar{X_n}\overset{a.s.}{\rightarrow}X$
  • Central limit theorem
    For a random sample with finite (population) mean $\mu$ and variace $\sigma^2$,let be the sample mean. The central limit theorem suggests that

    pointwise. In other words, $(\bar{X_n}-\mu)/(\sigma/\sqrt{n})$ converges in distribution to a standard normal random variable, written as $(\bar{X_n}-\mu)/(\sigma/\sqrt{n})\overset{d}{\rightarrow}N(0,1)$

    Point estimation - properties of estimators

  • Let the parameter be denoted as $\theta$. The point estimator of $\theta$, usually denoted as $\hat{\theta_n}$ where $n$ is the sample size, is a sample statistic used to estimate $\theta$. The realized value of $\hat{\theta_n}$ is called the point estimate.

  • Bias
    The bias of an estimator $\hat{\theta_n}$ is given byIf the bias is zero for every possible value of $\theta$, the estimator is unbiased.
    If the bias is not zero but tend to zero as $n\rightarrow \infty$, the estimator is asymptotically unbiased.
    An estimator tend to underestimate the true value if $\mathbb{E}(\hat{\theta_n})\lt\theta$ and overestimate the true value if $\mathbb{E}(\hat{\theta_n})\gt\theta$
  • Mean squared error
    The mean squared error(MSE) of an estimator $\hat{\theta_n}$ is given byTo achieve a low MSE, an estimator needs to be accurate (close to the true value) and precise (with little variability).
    For an unbiased estimator, the MSE is equal to $Var(\hat{\theta_n})$
  • Efficiency
    For two unbiased estimators and of , is said to be more efficient than iffor all possible values of the true parameter $\theta$.
    If either estimator is biased, it is better to make comparisons via the MSE since it also takes into account the magnitude of the bias.
    An unbiased estimator that has the smallest variance among all other unbiased estimators for all $\theta$ is known as the uniformly minimum variance unbiased estimator, or UMVUE.
  • Consistency
    An estimator $\hat{\theta_n}$ is consistent for $\theta$, if for every $\epsilon\gt0$ we haveIn other words, $\hat{\theta_n}\overset{p}{\rightarrow}\theta$ as $n\rightarrow\infty$ if it is consistent.
    This definition is often hard to check. A useful workaround(sufficient but no necessary condition) is that if then $\hat{\theta_n}$ is consistent for $\theta$.

    Point estimation -methods

  • The method of moments estimates parameters by equating the sample raw moments to the raw moments of the target distribution. They are defined as:
    Sample th raw mooment:
    th raw moment of the distribution: $\mathbb{E}(X^r)$
  • Method of maximum likelihood
    The method of maximum likelihood considers the pdf/pmf as a likelihood function that is maximized. Some definitions are in order:
    Suppose random variables $X_1,\dots,X_n$ have joint pdf or pmf $f_X(x_1,\dots,x_n;\theta)$, where $\theta$ is a collection of parameters. The likelihood function L is simply $f_X(x_1,\dots,x_n;\theta)$, but viewing it as a function of $\theta$ with $x_1,\dots,x_n$ fixed at their observed values. That is,

    The log-likelihood function is the (natural) logarithm of , i.e.,
    Note that need not be independent.
    The maximun likelihood estimator(MLE) $\hat{\theta_{ML}}$ of a parameter $\theta$ is the value of $\theta$ that maximizes the likelikood (or log-likelihood) function, that is,

    If is the MLE of , then is the MLE of , a function of .

    Interval estimation

  • For a random sample used to estimate an unknown parameter $\theta$, let $L(X)$ and $U(X)$ be some functions of the random sample with

    where $1-\alpha$ is typically a high probability. The interval $[L(X),U(X)]$ is known as a $100(1-\alpha)\%$ confidence interval(CI) for the parameter $\theta$.

  • The following is a general recipe for finding CI’s:
    1. Establish a pivotal quantity. A pivotal quantity is a function of the random sample and model parameters that has a distribution not involving $\theta$, written as $V(X\theta)$.
    2. Find some constants a,b such that Because the distribution of $V(X,\theta)$ does not depend on $\theta$, the constants a and b will also be free of $\theta$.
    3. Solve $a\le V(X,\theta)$ and $b\gt V(X,\theta)$ for $\theta$. This will give a lower limit $L(X)$ and an upper limit $U(X)$ such thatThe required CI is given by $[L(X),U(X)]$.
  • Interval estimation for means - $N(\mu,\sigma^2)$
    The pivotal quantity $\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0,1)$
    The CI for the population mean is given by $[\bar{X}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}]$
  • Interval estimation for means - $N(\mu,?)$
    The pivotal quantity $\frac{\bar{X}-\mu}{S/\sqrt{n}}\sim t(n-1)$
    The CI for the population mean is given by $[\bar{X}\pm t_{n-1,\alpha/2}\frac{S}{\sqrt{n}}]$
  • Interval estimation for means - vs
    The pivotal quantity is
    The CI for the difference in means is given by
  • Interval estimation for means - vs
    The pivotal quantity is where
    The CI for the difference in means is given by
  • Interval estimation for means - vs
    The pivotal quantity is The number of degrees of freedom isThe CI for the difference in means is given by
  • Interval estimation for variances - N(?,?)
    The pivotal quantity is
    The CI for the population variance is given by
  • Interval estimation for variances - vs
    The pivotal quantity is The CI for the ratio of variances is given by

    Hypothesis Testing

Introduction to hypothesis testing

  • In statistics, a hypothesis test makes a decision between two mutually exclusive statements about the population, known as hypotheses.
  • The null hypothesis, denoted as $H_0$, is a statement that has an established standing or the “standard” that is being put to the test.
  • The alternative hypothesis, denoted as or , is a statement that challenges the null hypothesis.
  • A simple hypothesis is one in which the hypothesis statement completely determines a single distribution. Otherwise, it is known as a composite hypothesis.
  • A test statistic is a function of the observed data that is used to construct the condition of a hypothesis test, based on which a decision is made.
  • The rejection (or critical) region is the range of values of the test statistic that, if observed, will lead to the rejection of (and acceptance of ).
  • The critical value(s) demarcates the rejection region.
  • The probability of making a type I error, , is the type I error rate (); it is also known as the significance level.
  • The probability of making a type II error, , is the type II error rate (). One minus this probability gives the power of the test.
  • The power function of a statistical test gives the probability of rejecting as a function of the true parameter value:
  • Note that if , the parameter space of , then gives the type I error rate.
    If is a composite hypothesis, then we define the size of the test as the maximum possible value of for all .
    A test has significance level if its size is at most . The significance level and size are equal in many cases.
    The power of a test is the probability of not making a type II error.When is composite, the power at is simply .
  • The p-value of a statistical test is the probability of observing a value of the test statistic at least as inconsistent as implied by , if is true. The test rejects if the p-value is less than the significance level.

General steps in hypothesis testing

  1. Formulate a statistical model (distribution if parametric).
  2. Specify the null and alternative hypotheses.
  3. Determine a test statistic . It is typically one with a nice distribution under , so that the significance level can be easily obtained.
  4. Determine the significance level .
  5. Collect data and calculate the test statistic. (Note: You must specify the significance level prior to data analysis — no cheating!)
  6. [Rejection region approach] Find the rejection region of that corresponds to the selected .
    [p-value approach] Calculate the p-value corresponding to the observed test statistic.
  7. If the observed test statistic is in the rejection region (or the p-value is less than ), you reject and accept . Otherwise you do not reject .

Duality between hypothesis tests and confidence intervals

The CI is the set of under which is not rejected.
Equivalently, if the hypothesized is not sufficiently far away from in the sense that it lies in the CI, the test will not reject .
This is known as the duality between hypothesis tests and CI’s.