Chapter 2 Probability Distributions

Chapter Structure

Exponential family distribution:

Exponential family distribution has a fixed formed conjugate prior.

  • Bernoulli distribution

    Conjugate Prior: Beta distribution

    One binary experiment: toss a coin

  • Binomial distribution

    Conjugate Prior: Beta distribution

    N binary experiments: toss a coin many times

  • Multinomial distribution

    Conjugate Prior: Dirichlet distribution

    Multiple value experiments: like throw the dice

  • Gaussian Distribution

    Conjugate Prior of Mean: Gaussain distribution

    Conjugate Prior of Variance: Inverse Gamma distribution

    1 dimension/ n dimensions

    marginal distribution/ conditional distribution

    Posteriori distribution

    Convolution

Parameter estimation:

  • Maximum Likelihood Estimation
  • Sequence estimation
  • Approximate evaluation

Non-informative prior distribution:

  • Student t-distribution
  • Periodic variable
  • GMM (Not a kind of Exponential family distribution)

Density Estimation

$\text{i.i.d. samples: }x ={x_1, x_2, …, x_n},\ \text{probability density: } p(x)=\left{ \begin{aligned} & \text{parameterized}\ & \text{non-parameterized} \end{aligned} \right.$

i.i.d.: independent and identically distribution

Parameter models: parameters control all the information of the model, such as mean and variance of Gaussaian distribution. Need time for training to estimate parameters. No need for training samples when inference.

Non-parameter models: parameters only control complexity of the model, such as k of k-means model.

Bernoulli distribution

$x\in {0, 1},\ p(x=0|\mu)=1-\mu\ \text{Bern}(x|\mu) = \mu^x(1-\mu)^{1-x}$

Perform once binary variable experiment, we get Bernoulli distribution / binary distribution.

mean and vairance given by,

$E[x]=\sum_{x\in{0,1}}xp(x)=0p(x=0)+1p(x=1)=\mu\ var[x] = \sum(x-E[x])^2p(x)\=E[x]^2p(x=0)+(1-E[x])^2p(x=1)=\mu(1-\mu)$

Given a data set $D(x_1, …, x_N)$ , estimate the parameter $\mu$, on the assumption taht the observations are drawn independently from $p(x|\mu)$, so that

$p(D|\mu) = \prod_{n=1}^Np(x_n|\mu) = \prod_{n=1}^N \mu^{x_n}(1-\mu)^{1-x_n}\ \ln p(D|\mu) = \sum_{n=1}^Np(x_n|\mu) = \sum_{n=1}^N{x_n\ln \mu + (1-x_n)\ln (1-\mu)}$

Let $\frac{\part \ln p(D|\mu)}{\part \mu}=0$,

$\sum_{n=1}^N{\frac{x_n}\mu - \frac{(1-x_n)}{(1-\mu)}} = 0$

Then,

$\sum_{n=1}^N\frac{x_n}\mu =\sum_{n=1}^N\frac{(1-x_n)}{(1-\mu)}\ (1-x_n)\sum_{n=1}^N x_n = x_n\sum_{n=1}^N(1-x_n)\ \mu = \frac{1}{N}\sum_{n=1}^N x_n$

Define function $S = T(x_1, …,x_n)$, if there are no unknown parameters, $S$ is a statistic for the data under distribution.

Define the density function depends on parameter $\theta$, if $p(x|\theta)$ can be written in the form

$p(x|\theta) = h(x)g_{\theta}(\Phi(x))$

$h(x)$ is independent with $\theta$, then $\Phi(x)$ or $\theta$ is a Sufficient statistic for the data under distribution.

We can say the sum of samples or the mean of samples is a Sufficient statistic for the data under Bernoulli distribution.

Binomial distribution

$\text{Bin}(m|N, \mu) = \left( \begin{aligned} m\n \end{aligned} \right) \mu^m(1-\mu)^{N-m}$

where

$\left( \begin{aligned} m\n \end{aligned} \right) = \frac{N!}{(N-m)!m!}$

the mean and variance are given by

$E[m] = \sum_{m=0}^N m\text{Bin}(m|N, \mu) = N\mu\ var[m] = \sum_{m=0}^N(m-E[m])^2\text{Bin}(m|N, \mu) = N\mu(1-\mu)$

Beta distribution