Contents

Probability Distribution

Some classic Probability distributions

Bernoulli Distribution

A distribution over a single binary random variable. It is controlled by a single parameter \(\phi \in [0, 1]\), which gives the probability of the random variable being equal to 1.

\begin{align*} P(\mathbf{x} = 1) &= \phi \\ P(\mathbf{x} = 0) &= 1 - \phi \\ p(\mathbf{x} = x) &= \phi^x(1-\phi)^{1-x}\\ E_\mathbf{x}[\mathbf{x}] &= \phi \\ \text{Var}_\mathbf{x}(\mathbf{x}) &= \phi(1- \phi) \end{align*}

Multinoulli Distribution

The multinoulli distribution is parameterized by a vector \(\mathbf{p} \in [0, 1]^{k-1}\), where p_i gives the probability of the i-th state. The final, k-th state’s probability is given by \(1 - \mathbf{1}^T\mathbf{p}\) We do not usually need to compute the expectation or variance of multinoulli-distributed random variables.

Gaussian Distribution

The most commonly used distribution over real numbers is the normal distribution:

\[ \mathcal{N}\left(x ; \mu, \sigma^{2}\right)=\sqrt{\frac{1}{2 \pi \sigma^{2}}} \exp \left(-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right) \]

\(\alpha \in (0, \infty)\) We can use a parameter \(\beta \in (0, \infty)\) to control the precision:

\[ \mathcal{N}\left(x ; \mu, \beta^{-1}\right)=\sqrt{\frac{\beta}{2 \pi }} \exp \left(-\frac{\beta}{2 }(x-\mu)^{2}\right) \]

Multivariate Normal Distribution

\[ \mathcal{N}\left(x ; \mu, \Sigma\right) =\sqrt{\frac{1}{(2 \pi) ^n \text{det}(\Sigma)}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right) \]

The parameter μ still gives the mean of the distribution, though now it is vector values. The parameters \(\Sigma\) gives the covariance matrix of the distribution, as in the univariate case, when we wish to evaluate the PDF several times for many different values of the parameters, the covariance is not a computationally efficient way to parameterize the distribution, since we need to invert \(\Sigma\) to evaluate the PDF. We can instead use a precision matrix \(\beta\):

\[ \mathcal{N}\left(x ; \mu, \beta^{-1}\right)=\sqrt{\frac{\text{det}(\beta)}{(2 \pi) ^n }} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right) \]

We often fix the covariance matrix to be a diagonal matrix. An even simpler version is the isotropic Gaussian distribution, whose covariance matrix is a scalar times the identity matrix.

Exponential and Laplace Distribution

In the context of deep learning, we often want to have a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exponential distribution:

\[ p(x; \lambda) = \lambda \mathbf{1}_{x \geq 0} \text{exp}(-\lambda x) \]

The exponential distribution assign probability zero to all negative values of x;

A closely related probability distribution that allows us to place a sharp peak of probability mass at an arbitrary point μ is the Laplace distribution

\[ \text{Laplace}(x; \mu, \gamma) = \frac{1}{2\gamma} \text{exp}(-\frac{|x-\mu|}{\gamma}) \]

Dirac Distribution and Empirical Distribution

In some cases, we wish to specify that all of the mass in a probability distribution clusters around a single point. This can be accomplished by defining a PDF using the Dirac delta function:

\[ p(x) = \delta(x-\mu) \]

The Dirac delta function is defined such that it is zero-valued everywhere except 0, yet integrates to 1.

The Dirac delta function is not an ordinary function that associates each value x with a real-valued output, instead it is a different kind of mathematical object called a generalized function that is defined in terms of its properties when integrated.

A common use of the Dirac delta distribution is as a component of an empirical distribution,

\[ \hat{p}(\mathbf{x}) = \frac{1}{m} \sum_{i=1}^{m}\delta(\mathbf{x} - \mathbf{x}^{(i)}) \] Which puts probability mass 1/m on each of the m points forming a given dataset or collection of samples.

The Dirac delta distribution is only necessary to define the empirical distribution over continuous variables. For discrete variables, the situation is simpler: an empirical distribution can be conceptualized as a multinoulli distribution, with a probability associated to each possible input value that is simply equal to the empirical frequency of that value in the training set.

We can view the empirical distribution formed from a dataset of training examples as specifying the distribution that we sample from when we train a model on this dataset. Another important perspective on the empirical distribution is that it is the probability density that maximizes the likelihood of the training data.