Chapter 2 Probability Distributions
Chapter Structure
Exponential family distribution:
Exponential family distribution has a fixed formed conjugate prior.
-
Bernoulli distribution
Conjugate Prior: Beta distribution
One binary experiment: toss a coin
-
Binomial distribution
Conjugate Prior: Beta distribution
N binary experiments: toss a coin many times
-
Multinomial distribution
Conjugate Prior: Dirichlet distribution
Multiple value experiments: like throw the dice
-
Gaussian Distribution
Conjugate Prior of Mean: Gaussain distribution
Conjugate Prior of Variance: Inverse Gamma distribution
1 dimension/ n dimensions
marginal distribution/ conditional distribution
Posteriori distribution
Convolution
Parameter estimation:
- Maximum Likelihood Estimation
- Sequence estimation
- Approximate evaluation
Non-informative prior distribution:
- Student t-distribution
- Periodic variable
- GMM (Not a kind of Exponential family distribution)
Density Estimation
$\text{i.i.d. samples: }x ={x_1, x_2, …, x_n},\ \text{probability density: } p(x)=\left{ \begin{aligned} & \text{parameterized}\ & \text{non-parameterized} \end{aligned} \right.$
i.i.d.: independent and identically distribution
Parameter models: parameters control all the information of the model, such as mean and variance of Gaussaian distribution. Need time for training to estimate parameters. No need for training samples when inference.
Non-parameter models: parameters only control complexity of the model, such as k of k-means model.
Bernoulli distribution
$x\in {0, 1},\ p(x=0|\mu)=1-\mu\ \text{Bern}(x|\mu) = \mu^x(1-\mu)^{1-x}$
Perform once binary variable experiment, we get Bernoulli distribution / binary distribution.
mean and vairance given by,
$E[x]=\sum_{x\in{0,1}}xp(x)=0p(x=0)+1p(x=1)=\mu\ var[x] = \sum(x-E[x])^2p(x)\=E[x]^2p(x=0)+(1-E[x])^2p(x=1)=\mu(1-\mu)$
Given a data set $D(x_1, …, x_N)$ , estimate the parameter $\mu$, on the assumption taht the observations are drawn independently from $p(x|\mu)$, so that
$p(D|\mu) = \prod_{n=1}^Np(x_n|\mu) = \prod_{n=1}^N \mu^{x_n}(1-\mu)^{1-x_n}\ \ln p(D|\mu) = \sum_{n=1}^Np(x_n|\mu) = \sum_{n=1}^N{x_n\ln \mu + (1-x_n)\ln (1-\mu)}$
Let $\frac{\part \ln p(D|\mu)}{\part \mu}=0$,
$\sum_{n=1}^N{\frac{x_n}\mu - \frac{(1-x_n)}{(1-\mu)}} = 0$
Then,
$\sum_{n=1}^N\frac{x_n}\mu =\sum_{n=1}^N\frac{(1-x_n)}{(1-\mu)}\ (1-x_n)\sum_{n=1}^N x_n = x_n\sum_{n=1}^N(1-x_n)\ \mu = \frac{1}{N}\sum_{n=1}^N x_n$
Define function $S = T(x_1, …,x_n)$, if there are no unknown parameters, $S$ is a statistic for the data under distribution.
Define the density function depends on parameter $\theta$, if $p(x|\theta)$ can be written in the form
$p(x|\theta) = h(x)g_{\theta}(\Phi(x))$
$h(x)$ is independent with $\theta$, then $\Phi(x)$ or $\theta$ is a Sufficient statistic for the data under distribution.
We can say the sum of samples or the mean of samples is a Sufficient statistic for the data under Bernoulli distribution.
Binomial distribution
$\text{Bin}(m|N, \mu) = \left( \begin{aligned} m\n \end{aligned} \right) \mu^m(1-\mu)^{N-m}$
where
$\left( \begin{aligned} m\n \end{aligned} \right) = \frac{N!}{(N-m)!m!}$
the mean and variance are given by
$E[m] = \sum_{m=0}^N m\text{Bin}(m|N, \mu) = N\mu\ var[m] = \sum_{m=0}^N(m-E[m])^2\text{Bin}(m|N, \mu) = N\mu(1-\mu)$