Home > theory > Looking at probability distributions

Looking at probability distributions

I’ve been taking two classes in probability this semester and in those I saw the proofs of a lot of interesting theorems which I knew about previously but I have never seen the proof, as the Central Limit Theorem, the Laws of Large Numbers and so on… Also, some theory which is looks somewhat ugly in the undergrad courses becomes very clear with the proper formal treatment. Today I was thinking what was the main take-home message that a computer scientist could take from those classes and. at ;east for me, this message is the various ways of looking to probability distributions. I’ve heard about moments, Laplace transform, Fourier transform and other tools like that, but I never realized before their true power. Probably still today, most of their true power is hidden from me, but I am starting to look at them in a different way. Let me try to go over a few examples of different ways we can look at probability distributions and show cases where they are interesting.

Most of ways of looking at probability distributions are associated with multiplicative system: a multiplicative system {Q} is a set of real-valued functions with the property that if {f_1, f_2 \in Q} then {f_1 \dot f_2 \in Q}. Those kinds of sets are powerful because of the Multiplicative Systems Theorem:

Theorem 1 (Multiplicative Systems Theorem) If {Q} is a multiplicative system, {H} is a linear space containing {1} (the constant function {1}) and is closed under bounded convergence, then {Q \subseteq H} implies that {H} contains all bounded {\sigma(Q)}-measurable functions.

The theorem might look a bit cryptic if you are not familiar with the definitions, but it boils down to the following translation:

Theorem 2 (Translation of the Multiplicative Systems Theorem) If {Q} is “general” multiplicative system, {X} and {Y} are random variable such that {\mathop{\mathbb E}{f(X)} = \mathop{\mathbb E}{f(Y)}} for all {f \in Q} then {X} and {Y} have the same distribution.

where general excludes some troublesome cases like {Q = \{1\}} or all constant functions, for example. In technical terms, we wanted {\sigma(Q) = \sigma\{f^{-1}((-\infty,a]); a\in {\mathbb R}, f \in Q \}} to be the Borel {\sigma}-algebra. But let’s not worry about those technical details and just look at the translated version. We now, discuss several kinds of multiplicative systems:

  1. The most common description of the a random variable is by the cummulative distribution function {F(u) = P\{X \leq u\}}. This is associated with {Q = \{ 1_{(-\infty,u]}(x); u \in {\mathbb R} \}} notice that simply {F(u) = \mathop{\mathbb E}{1_{(-\infty,u]}}}.
  2. We can characterize a random variable by its moments: the variable {X} is characterized by the set {M_n = \int_{\mathbb R} x^n \mu_X(dx)}. Given the moemnts {M_1, M_2, \hdots}, the variable is totally characterized, i.e., if two variables have the same moments, then they have the same distribution by the Multiplicative Systems Theorem. This description is associated with the system {Q = \{x^n; n = 1, 2, \hdots\}}
  3. Moment Generating Function: If {X} is a variable that assumes only integer values, we can describe the it as {p_0, p_1, p_2, \hdots}, where {p_n = P\{ X = n \}}. An interesting way of representing those probabilities is as the moment generating function {\psi_X(z) = \mathop{\mathbb E}{z^X} = \sum_{n=0}^\infty p_n z^n}. This is associated with the multiplicative system {Q = \{z^x, 0 \leq z < 1\}}.Now suppose we are given two discrete independent variables {X} and {Y}. What do we know about {X+Y}. It is easy to know its expectation, its variance, … but what about more complicated things? What is the distribution of {X+Y} ? Moment generating functions answer this question very easily, since:

    \displaystyle \psi_{X+Y} (z) = \mathop{\mathbb E} [z^{X+Y}] = \mathop{\mathbb E} [z^{X}] \cdot \mathop{\mathbb E}[z^{Y}] = \psi_X(z) \cdot \psi_Y(z)

    If we know moment generating functions, we can calculate expectation very easily, since {\mathop{\mathbb E}[X] = \psi'_X(1)}. For example, suppose we have a process like that: there is one bacteria in time {t = 0}. In each timestep, either this bacteria dies (with probability {p_0}), continues alive without reproducing (with probability {p_1} or has {k} offsprings (with probability {p_{1+k}}). In that case {\sum_0^\infty p_n = 1}. Each time, the same happens, independently with each of the bacteria alive in that moment. The question is, what is the expected number of bacteria in time {t} ?

    It looks like a complicated problem with just elementary tools, but it is a simple problem if we have moment generating functions. Just let {X_{ti}} be the variable associated with the {i^{th}} bacteria of time {t}. It is zero if it dies, {1} if it stays the same and {k+1} if it has {k} offsprings. Let also {N_t} be the number of bacteria in time {t}. We want to know {\mathop{\mathbb E} N_t}. First, see that:

    \displaystyle N_t = \sum_{i=1}^{N_{t-1}} X_{ti}

    Now, let’s write that in terms of moment generating functions:

    \displaystyle \psi_{N_t}(z) = \mathop{\mathbb E}{z^{X_{t1} + \hdots + X_{tN_t}}} = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot \mathop{\mathbb E}[z^{X_{t1} + \hdots + X_{tN_t}} \vert N_t = k]

    which is just:

    \displaystyle \psi_{N_t}(z) = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot \mathop{\mathbb E}[z^{X_{t1} + \hdots + X_{tk}}] = \sum_{k=0}^\infty P\{N_{t-1} = k \} \cdot \psi_X (z)^k

    since the variables are all independent and identically distributed. Now, notice that:

    \displaystyle \psi_{N_{t-1}}(z) = \sum_{k=0}^\infty P\{N_{t-1} = k\} \cdot z^k

    by the definition of moment generating function, so we effectively proved that:

    \displaystyle \psi_{N_{t}}(z) = \psi_{N_{t-1}}(\psi_X(z)) = \psi_X \psi_X \hdots \psi_X(z)

    We proved that {\psi_{N_{t}}(z)} is just {\psi(z)} iterated {t} times. Now, calculating the expectation is easy, using the fact that {\psi_X(1) = 1} and {\psi'_X(1) = \mathop{\mathbb E} X}. Just see that: {\mathop{\mathbb E} N_t = \psi_{N_{t}}'(1) = \psi_{N_{t-1}}'(\psi_X(1)) \cdot \psi'_X(1) = \psi_{N_{t-1}}'(1) \cdot \mathop{\mathbb E} X }. Then, clearly {\mathop{\mathbb E} N_t = (\mathop{\mathbb E} X)^t}. Using similar technique we can prove a lot more things about this process, just by analyzing the behavior of the moment generating function.

  4. Laplace Tranform: Now, moving to continuous variables, if {X} is a continuous non-negative variable we can define its Laplace tranform as: {\phi_X(u) = \mathop{\mathbb E} [e^{u X}] = \int_0^\infty e^{ux} \mu_X(dx)}, where {\mu_X(dx)} stands for the distribution of {X}, for example, {\rho_X(x) dx}. This is associated with the multiplicative system {Q = \{e^{ux}; u \geq 0\}}. Again, by the Multiplicative Systems Theorem, if {\phi_X = \phi_Y}, then the two variables have the same distribution. The Laplace tranform has the same nice properties as the Moment Generating Function, for example, {\phi_{X+Y}(u) = \mathop{\mathbb E}[e^{(X+Y)u}] = \mathop{\mathbb E}[e^{Xu}] \cdot \mathop{\mathbb E}[e^{Yu}] = \phi_X(u) \cdot \phi_Y(u)}.And it allows us to do similar tricks than the one I just showed for Moment Generating Functions. One common trick that is used, for example, in the proof of Chernoff bounds is, given independent non-negative random variables:

    \displaystyle P\left\{\sum_i X_i > u\right\} = P\left\{e^{\sum_i X_i} > e^u\right\} \leq \frac{\mathop{\mathbb E}[e^{\sum_i X_i} ]}{e^u} = \frac{\prod_i \mathop{\mathbb E}[e^{X_i} ]}{e^u}

    where we also used Markov Inequality: {\mathop{\mathbb E}[X] \geq \beta P\{X \geq \beta\}}. Passing to the Laplace transform is the main ingredient in the Chernoff bound and it allows us to sort of “decouple” the random variables in the sum. There are several other cases where the Laplace transform proves itsself very useful and turns things that looked very complicated when we saw in undergrad courses into simple and clear things. One clear example of that is the motivation for the Poisson random variable:

    If {T_i} are independend exponentially distributed random variables with mean {\lambda}, then {\rho_{T_i}(t) = \lambda e^{- \lambda t}}. An elementary calculation shows that its laplace transform is {\phi_{T_i}(u) = \frac{\lambda}{\lambda+u}}. Let {S_n = T-0 + T_1 + T_2 + \hdots + T_n}, i.e., the time of the {(n+1)^{th}} arrival. We want to know what is the distribution of {S_n}. How to do that?

    \displaystyle \phi_{S_n}(u) = \mathop{\mathbb E}[e^{u(T_0 + \hdots + T_n)}] = \phi_T(u)^{n+1} = \left( \frac{\lambda}{\lambda+u} \right)^{n+1}

    Now, we need to find {\rho_{S_n}} such that {\int_0^\infty \rho_{S_n}(t) e^{iu} dt = \left( \frac{\lambda}{\lambda+u} \right)^{n+1}}. Now it is just a matter of solving this equation and we get: {\rho_{S_n}(t) = \frac{\lambda (\lambda t)^n}{n!} e^{-\lambda t}}. Now, the Poisson varible {N_t} measures the number of arrivals in {[0,t]} and therefore:

    \displaystyle \begin{aligned} P\{N_t = n\} & = P\{S_{n-1} < t < S_n\} = P\{S_n > t\} - P\{S_{n-1} \geq t\} \\ & = \int_t^\infty \rho_{S_n}(t) dt - \int_t^\infty \rho_{S_{n-1}}(t) dt = \frac{(\lambda t)^n}{n!} e^{-\lambda t} \end{aligned}

  5. Characteristic Function or Fourier Tranform: Taking {Q = \{e^{i\lambda x}; x \in \mathbb{C}\}} we get the Fourier Transform: {f_X(\lambda) = \mathop{\mathbb E} e^{i \lambda x}} which also has some of the nice properties of the previous ones and some additional ones. The characteristic functions were the main actors in the development of all the probability techniques that lead to the main result of 19th century Probability Theory: the Central Limit Theorem. We know that moment generating functions and Laplace transforms completely characterize the distributions, but it is not clear how to recover a distribution once we have a transform. For Fourier Transform there is a cleas and simple way of doing that by means of the Inversion Formula:

    \displaystyle \rho_X(x) = \frac{1}{2 \pi} \int_{\mathbb R} e^{-i \lambda x} f_X(\lambda) d\lambda

    One fact that always puzzled me was: why is the normal distribution {\rho_N (x) = \frac{1}{\sqrt{2 \pi}} e^{-x^2/2} } so important? What does it have in special to be the limiting distribution in the Central Limit Theorem, i.e., if {X_1, X_2, \hdots, X_n} is a sequence of independent random variables, {S_n = \sum_1^n X_i - \sum_1^n \mathop{\mathbb E} X_i} then {S_n / \sqrt{var S_n} \rightarrow N(0,1)} under some natural conditions on the variables. The reason the normal is so special is because it is a “fixed point” for the Fourier Transform. We can see that {f_N(\lambda) = e^{-\lambda_2/2}}. And there we have something special about it that makes me believe the Central Limit Theorem.

————————-

This blog post was based on lectures by Professor Dynkin at Cornell.

Categories: theory Tags: ,
  1. No comments yet.
  1. No trackbacks yet.