Notes on Coffee, Books, and Programming

Probability, Entropy, and Inference (IITLA 2)

Jun 15, 2020

2.1: Probabilities and ensembles

Last time, we uncovered a really difficult situation in trying to make a decent disk drive, that of maintaining permanence over the time we first put in some data, and the time we look the data up. We then survived this, and then “solved” this through the all-seeing majesty of Shannon’s Information Theory, showing that there are crazy theoretical limits that are possible, promising to give a thorough explanation as to just how this is possible.

With this settled, as now need to step back and get some of the basics covered. To get into just how Shannon’s theory is constructed, we need to understand some basic concepts, namely random variables, and entropy.

These concepts will help us formalize some of our notions, and, like all good mathematical concepts, allow us to better calculate things without needing to go to probabilities, which can be really hard to do a lot of the time.

With that said, let’s start getting formal. The first concept we need to explain is that of an “ensemble”:


Ensemble - Some triple XX , (x,AX,PX)(x, A_X, P_X), with three terms: An outcome, xx, that is the value of a random variable; an alphabet, AX={a1,a2,...,ai,...,aI}A_X = \{a_1, a_2,...,a_i, ..., a_I\}, which is the set of values the outcome can have; and a set of probabilities, PX={p1,p2,...,pI}P_X = \{p_1, p_2, ...,p_I\} , which give each term in the alphabet some non-negative ( but possibly zero ) probability such that the sum of the probabilities for each member of the alphabet is 1.


Ensembles serve as our description and underlying formal version of a Random Variable, specifically Discrete Random Variables. It encodes some set of events, out of which one event is certain to happen, and the associated chances, or probabilities, that each event could have.

Thus, we could encode the whole space of outcomes of a dice with the following ensemble:

D=(d,{1,2,3,4,5,6},{16,16,16,16,16,16})D = (d, \{1,2,3,4,5,6\}, \{\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}\})

If we wish to get the probability of a 1 from this ensemble, for example, we could ask for the following:

P(d=1)=16P(d=1)=\frac{1}{6}

Now, writing the exact outcome for an ensemble can be a very repetitive affair when discussing probabilities, so we’ll abbreviate P(x=ai)P(x=a_i) as P(ai)P(a_i) or P(x)P(x) from now on.

If we want more than the probability of a single event, and instead want the probability of some subset of the alphabet, we can do that by simply summing the probabilities of each event in the subset:

P(T)=P(xT)=aiTP(x=ai)P(T) = P(x\in T) = \sum_{a_i \in T} P(x=a_i)


Besides simple ensembles, we can also define Joint Ensembles, XYXY, which are ensembles where each outcome is an ordered pair x,yx,y where x’s elements come from some alphabet AXA_X, and y’s potential outcomes come from another alphabet AYA_Y. Like normal ensembles, we can then ask for the probabilities of events in the joint ensemble, but instead of using one input, P(x)P(x), we ask for multiple, P(x,y)P(x,y). While commas are optional here, i.e. P(xy)=P(x,y)P(xy) = P(x,y) , I’m going to use commas to make the parameters more explicit.


With Joint Ensembles , we can marginalize them to get the probabilities of each portion of the pair, which are called “marginal probabilities”. So, for example, we can get P(x)P(x) through the joint probability P(x,y)P(x,y) through the following equation:

P(x=ai)=yAyP(x=ai,y)P(x=a_i) = \sum_{y \in A_y}P(x=a_i, y)

In effect, by simply summing over all possibilities of one term, we can get the probabilities of the other term without that dependence.

We can also define so-called conditional probabilities, which allow us to determine just how much the probabilities of one term in an ensemble rely on information gained from the other terms in the ensemble. They’re calculated in the following way:

P(x=aiy=bj):=P(x=ai,y=bjP(y=bj) if P(y=bj)0P(x=a_i|y=b_j) := \frac{P(x=a_i, y=b_j}{P(y=b_j)} \text{ if } P(y=b_j) \neq 0

where we pronounce P(x=aiy=bj)P(x=a_i|y=b_j) is denoted as the probability that xx equals aia_i, knowing that yy is bjb_j.


Rules and Regulations

With all of these definitions over with (for now), there are also some rules we need to cover, and cover quickly. So, without further ado, here is a quick breakdown of the rules covered in the text!

The Product Rule:

P(x,yH)=P(xy,H)P(yH)=P(yx,H)P(xH)P(x,y|H) = P(x|y,H)P(y|H) = P(y|x,H)P(x|H)

The Sum Rule:

P(xH)=yP(x,yH)=yP(xy,H)P(yH)P(x|H) = \sum_{y}P(x,y|H) \\ = \sum_yP(x|y,H)P(y|H)

Bayes’ Theorem:

P(yx,H)=P(xy,H)P(yH)P(xH)=P(xy,H)P(yH)yP(xy,H)P(yH)P(y|x,H) = \frac{P(x|y,H)P(y|H)}{P(x|H)} \\ = \frac{P(x|y,H)P(y|H)}{\sum_{y'}P(x|y',H)P(y'|H)}

If you haven’t seen this theorem yet, it’s one of the more important ones in all of probability, so I’d suggest keeping an eye out as you’ll see it again and again.

Independence:

We say that two random variables, XX and YY, are independent, iff:

P(x,y)=P(x)P(y),x,yP(x,y) = P(x)P(y), \forall x,y


2.2: The meaning of probability

With us having defined some of the basic mathematical notions of probability, let’s actually try to understand what we just defined.

The question of “What does probability mean?” is, in fact, something that has been the topic of discussion for years. In deliberating this fundamental concept, two major schools of thought have come about: Frequentism and Bayesian-ism.

Frequentism describes probabilities as frequencies of outcomes in random experiments. So, if we repeat some experiment lots of times, the average fraction of events in these experiments serve as the approximate “probability” of those events. For most simple applications, it serves as a decent guide and a warning, as things like “1 in a million” really happen to people, as we have billions to contend with.

On the other hand, Bayesians, describe degrees of belief in propositions that do not involve random variables. This is all based from Bayes’ rule, which allows to take the probability of certain evidence given an event occurs, and invert it, getting the probability, or belief, that an event occurs given we know some evidence.

To strictly encode the idea of beliefs into probabilities, we need to make sure our ideas of beliefs are governed by some axioms. In this case, so long as our degrees of belief maintain some basic assumptions, we can equate belief with probabilities.

In this case, the so-called axioms we need to consult are the Cox Axioms. If we denote “the degree of belief in proposition xx” as B(x)B(x), and the degree of belief xx assuming some other proposition, yy, is true, is denoted as B(xy)B(x|y), then we need to assume the following to map degrees of belief onto probabilities:

  1. Firstly, we need to assume that degrees of belief can be ordered. If B(x)>B(y)B(x) > B(y), and B(y)>B(z)B(y) > B(z), then B(x)>B(z)B(x) > B(z). In other words, degrees of freedom are transitive under the “greater than” operator.
  2. Secondly, we need to assume that there is some function relating the degree of belief in a proposition xx and it’s negation, xˉ\bar{x}. Mathematically, we want the following:

B(x)=f[B(xˉ)]B(x) = f[B(\bar{x})]

  1. Thirdly, we need to assume that there is some function relating the degree of belief in the conjunction of two propositions, xyx \land y, the degree of belief in the conditional proposition xyx | y, and the degree of belief in the proposition yy. Thus, we want the following:

B(x,y)=g[B(xy),B(y)]B(x,y) = g[B(x|y), B(y)]

In other words, so long as we can say some beliefs are more likely than others, that beliefs and their opposites are related, and that the combination of two beliefs is related in some way to the beliefs themselves, we know that we can map our understanding of beliefs to probabilities, and make probabilities from the more objective thing we’re used to to the more subjective idea of belief.

In the minds of Bayesians, we cannot infer anything about the world without assumptions, and that probabilities belie confidences. This book runs with the Bayesian viewpoint on statistics, as we wish to infer more about data, so we’ll use this standpoint from now on.

2.3 Forward probabilities and inverse probabilities

When calculating probabilities from the viewpoint of Bayesians, there end up being two kinds of problems: that of forward probability and that of inverse probability.

Forward Probability Problems refer to those probability questions where you have some randomly generative model that gives some data, and our goal is to compute the probability distribution or some other data-dependent quantity. In other words, our goal is to take data or a model, and compute some facts about it.

Inverse Probability Problems, however, refer to those probability questions where you still have a generative model, but instead of computing the probability distribution produced by the process, we compute the conditional probability of some unobserved variables given the data. In other words, instead of computing known quantities, we try to reason about the things we don’t know, and infer something about the original problem.

To solve inverse probability problems, we’ll definitely need to use Bayes’ Theorem, which inverts our expectations, as we usually know the dependence of the observed variables on the unobserved, and not the other way around. For the sake of making things easier to understand, then, we define some terms that come up repeatedly in using these problems:

Firstly, we call the marginal probability of the unobserved variables, P(u)P(u), the prior probabilities of u. This tells us the already-known probabilities of certain aspects of the unobserved variables and, if we did not know any data, what we should guess to optimally guess correctly.

Secondly, we call the probability of the observed variables given different values for the unobserved variables to be the likelihood of the unobserved variables, which we denote P(ou,M)P(o|u,M). This is not a probability distribution, as it’s a function of the parameters as well as the observed variables.

Thirdly, we’ll call the probability of the unobserved variables as functions of the observed variables P(uo,M)P(u|o,M).

Lastly, the probability of the observed variables given the model is the evidence and the marginal likelihood, denoted P(nBN)P(n_B|N).

Thus, our Bayesian equation becomes the following, which basically allows us to quantify the usage of evidence in beliefs, where Θ\Theta refers to the unobserved parameters, DD refers to the data, and HH refers to the space of hypothesis:

P(θD,H)=P(Dθ,H)P(θH)P(DH)P(\theta|D,H) = \frac{P(D|\theta,H)P(\theta|H)}{P(D|H)}

Which becomes the following, if we look at the terminology more closely:

Posterior Prob.=Likelihood× Prior Prob.Evidence Likelihood\text{Posterior Prob.} = \frac{\text{Likelihood} \times \text{ Prior Prob.}}{\text{Evidence Likelihood}}

This aspect of probabilities - that they represent likelihoods - allow us to better represent what’s going on and understand how certain things we wish to know about the world can come about from what we know. This leads to the so-called likelihood principle:


If we were given a generative model for some data, dd, with some parameters θ\theta, and some probability of data given those parameters P(dθ)P(d|\theta) , then if we observe some data d1d_1, then all inferences must only depend on P(d1θ)P(d_1|\theta), or the probability of some element given some element.


2.4: Definition of entropy and related functions

Given an outcome, xx, we define the Shannon information content of xx, as the following:

h(x)=log2P(x)h(x) = -\log_2{P(x)}

This is measured in “nats” if the logarithm is base ee, while it’s measured in “bits” if the logarithm is base 22.

We’ll go over why this is an accurate understanding of the idea of “information content” for an outcome later, but all we’re doing now is assigning some value that is higher for less-likely events and is lower for more-likely events.

Besides defining the Shannon Information Content of an outcome, we can also define the Entropy of an Ensemble, XX, in the following way:

H(X)xAXP(x)log(P(x))H(X) \equiv -\sum_{x\in A_X}P(x)\log(P(x))

where, if P(x)=0P(x) = 0, then h(x)=0h(x) = 0, as limθ0+θlog1/θ=0\lim_{\theta \rightarrow 0^{+}}\theta \log 1/\theta = 0.

This is simply the weighted average Shannon Information Content of some ensemble, and is a profoundly deep quantity we’ll explore in more detail later.

This function also has the following nice properties:

  • H(X)0H(X) \geq 0 , with equality iff pi s.t. pi=1\exist p_i \text{ s.t. } p_i = 1
  • Maximum entropy is only maximized if the probability distribution is a uniform distribution.

We also define the redundancy of an ensemble. This measures the fractional difference between the Entropy and it’s highest-entropy sample. Thus, for uniformly-distributed PDFs, where there is no redundancy, it’s 0, while for PDFs where the probability of a single event is 1, where there is only redundancy, it is 1:

Redundancy = 1H(X)log2AX\text{Redundancy = } 1 - \frac{H(X)}{\log_2{\vert A_X\vert}}

We can also define the Entropy of a Joint Distribution like the following:

H(X,Y)=xyAXAYP(x,y)log(P(x,y))H(X,Y) = \sum_{xy\in A_XA_Y}-P(x,y)\log(P(x,y))

If the probability distributions are independent, then, as expected, the joint entropy is the sum of the component entropies:

H(X,Y)=H(X)+H(Y)H(X,Y) = H(X) + H(Y)

2.5: Decomposability of the entropy

As an important note, we can compute entropy recursively. If we re-notate H(X)H(X), the entropy of our ensemble, as H(p)H(\bold{p}), where pp is the probability vector associated with our ensemble, then we can look at things recursively:

H(p)=H(p1,1p1)+(1p1)H(p21p1,p31p1,...,pI1p1)H(\bold{p}) = H(p_1,1-p_1) + (1-p_1)H(\frac{p_2}{1-p_1},\frac{p_3}{1-p_1},...,\frac{p_I}{1-p_1})

2.6: Gibbs’ inequality

The relative entropy / Kullback-Leibler divergence is the following between two probability distributions, P(x)P(x) and Q(x)Q(x) over the same alphabet:

DKL(PQ)=xP(x)logP(x)Q(x)D_{KL}(P||Q) = \sum_{x}P(x)\log\frac{P(x)}{Q(x)}

This quantity is not symmetric, but it does satisfy an important quantity:

DKL(PQ)0D_{KL}(P||Q) \geq 0

with equality only if P and Q refer to the same distribution.

2.7: Jensen’s inequality for convex functions

We define a convex function f(x)f(x) as any function satisfying the following inequality:

f(λx1+(1λ)x2)λf(x1)+(1λ)f(x2)f(\lambda x_1 + (1-\lambda )x_2) \leq \lambda f(x_1) + (1-\lambda)f(x_2)

for all lambda between 0 and 1, and all x1 and x2. A function is strictly convex if this includes 0 and 1.

If we have a convex function, then Jensen’s inequality is in play:

E[f(x)]f(E[x])\Epsilon[f(x)]\geq f(E[x])

This surprisingly simple inequality is stupidly useful for a variety of problems, and serves as the end of this chapter.


Written by Arav Agarwal a UofM CS Student who has a habit of explaining things aloud