Shannon’s seminal paper “A Mathematical Theory of Communication” introduced the concept of entropy as a measure of information. The paper is a cornerstone in the field of information theory, and it has had a profound impact on various scientific disciplines. The concept of entropy is widely used in physics, computer science, and statistics, among others. In this blog post, I would like to present an alternative derivation of Shannon entropy based on the advancements in coding theory.

This derivation, among others, will be included in my book “Data Compression in Depth”. The book is a comprehensive guide to data compression, covering:

  • Theoretical foundations: probability theory, information theory, asymptotic equipartition theorem, etc.
  • Practical applications: the book presents a variety of algorithms and techniques alongside example implementations in the ISO C99 programming language.
  • Latest advancements in the field like the Asymmetric Numeral Systems (ANS).

If you are interested in the book, make sure to follow my website for updates - I also maintain a RSS feed for the newest blog posts.

Preliminaries

We introduce the Shannon entropy of a message ensemble (X,X,p)(X, \mathcal{X}, p) as follows:

H(p)=E[logp(x)]=xXp(x)logp(x) \mathrm H(p) = \mathbb E[-\log p(x)] = -\sum_{x \in \mathcal{X}} p(x) \log p(x)

where X\mathcal{X} is the set of possible values of XX, and pp is the probability mass function of XX. The logarithm is generally taken to base 2, and the entropy is measured in bits. An optimal code assigns codeword lengths to the values of X\mathcal{X} such that the average codeword length is precisely the entropy of the message ensemble. Hence, for all optimal codes, the following holds:

lˉ=E[l(x)]=xXp(x)l(x)=H(p) \bar l = \mathbb E[l(x)] = \sum_{x \in \mathcal{X}} p(x) l(x) = \mathrm H(p)

The Kraft-McMillan inequality states that, given an uniquely decodable code with codeword lengths ll, the following holds:

xX2l(x)1 \sum_{x \in \mathcal{X}} 2^{-l(x)} \le 1

If the inequality is satisfied with equality, then the code is complete (i.e. adding a new codeword and associating it with a symbol would make the code no longer uniquely decodable). The Kraft-McMillan inequality is a necessary and sufficient condition for the existence of a uniquely decodable code.

The derivation

Suppose that we do not know the formula for H(p)\mathrm H(p). We wish to minimise the expected codeword length lˉ\bar l, while satisfying the Kraft-McMillan inequality. We consider the constrained minimisation via Lagrange multipliers:

J=xXp(x)l(x)+λ(xX2l(x)) J = \sum_{x \in \mathcal{X}} p(x) l(x) + \lambda \left(\sum_{x \in \mathcal{X}} 2^{-l(x)}\right)

Consequently,

Jl(x)=p(x)λln22l(x)=0 \frac{\partial J}{\partial l(x)} = p(x) - \lambda \ln 2 \cdot 2^{-l(x)} = 0

Solving for 2l(x)2^{-l(x)} yields:

2l(x)=p(x)λln2 2^{-l(x)} = \frac{p(x)}{\lambda \ln 2}

We come back to the Kraft-McMillan constraint and simplify via properties of PMFs:

xX2l(x)=xXp(x)λln2=1λln2 \sum_{x \in \mathcal{X}} 2^{-l(x)} = \sum_{x \in \mathcal{X}} \frac{p(x)}{\lambda \ln 2} = \frac{1}{\lambda \ln 2}

In order to request a complete code, we solve 1=1λln21 = \frac{1}{\lambda \ln 2} for λ\lambda, which yields λ=1ln2\lambda = \frac{1}{\ln 2}. Substituting back yields:

2l(x)=p(x) 2^{-l(x)} = p(x)

Taking the logarithm of both sides and multiplying by 1-1 yields:

l(x)=log2p(x) l(x) = -\log_2 p(x)

The quantity log2p(x)-\log_2 p(x) is frequently called the self-information of xx, and it measures the amount of information in bits that is gained when the value xx is observed. Such a choice of non-integer codeword lengths yields an optimal code, the average length of which is:

lˉ=xXp(x)l(x)=xXp(x)(log2p(x)) \bar l = \sum_{x \in \mathcal{X}} p(x) l(x) = \sum_{x \in \mathcal{X}} p(x) (-\log_2 p(x))

Hence, we have proven that the theoretical limit in compression of memoryless IID sources is the Shannon entropy.

Non-integer codeword lengths

Of course, it is not exactly feasible to build e.g. a Huffman codebook that assigns one and a half bits to a symbol. Merely rounding the codeword lengths to a integer will not necessarily yield an optimal code. It is easy to see that it would be the case if the input followed a dyadic distribution (i.e. the probabilities are of the form 12k\frac{1}{2^k}). On the other hand, the upper bound of the Shannon coding (assigning codewords of lengths equal to rounded self-information) of lˉ=H(p)+1\bar l = \mathrm H(p) + 1 is attained by noticing maxxx1\max \lceil x \rceil - x \approx 1 and thus H(p)xXp(x)log2p(x)1\mathrm H(p) - \sum_{x \in \mathcal{X}} p(x) \lceil -\log_2 p(x) \rceil \approx 1. A way to assign integer code lengths to symbols such that under this constraint the code is optimal is attained by Huffman coding.

Algorithms that can assign a fractional amount of bits to a symbol are considered to be making use of fractional bits, like ANS or arithmetic coding. This technique allows for the construction of codes that are arbitrarily close to the Shannon limit. Practically, the integer implementations of arithmetic coding is not optimal, but still more efficient (CR-wise) than Huffman coding.