There's a CLT for maxes

I recently added a quote randomizer to my website’s homepage. Upon each page load, the randomizer chooses a quote to display from a pool of about 35 options—usually from literature, but sometimes from the band Geese (which counts as literature, let’s be honest).

As I was testing the randomizer, repeatedly refreshing locally to see how the different lengths were getting formatted, I started to wonder about the average number of page reloads before a user sees all 35 quotes. Is it 35 times, 70 times, 100, 200, 500? It turns out that this maps exactly to a well-known problem in probability called the coupon collector’s problem—so called because you calculate how many cereal boxes you need to buy on average before you’ve collected all the different types of coupons within. Or yet another formulation: How many McDonald’s Happy Meals did you need to buy in the early ’90s before you collected all the Super Mario 3 toys? (I never did. Also this assumes that all four were on hand at the drive-thru, and you were randomly assigned one with equal probabilities for each. I just kept getting Luigi on a cloud.)

McDonald’s placemat for the Super Mario 3 Happy Meal. Source: kidsmeal.fandom.com — McDonald’s placemat for the *Super Mario 3* Happy Meal. Source: kidsmeal.fandom.com

In this post, I cover the expectation calculation of this special random variable with its limiting behavior, as well as the CDF with its limiting behavior. The outline:

(1) The exact formula for the expectation of the coupon-collector random variable

(2) The $O(n \log n)$ approximation of (1)

(3) The exact formula for the CDF (“What is the probability that we’ve covered all 35 quotes after $t$ refreshes?”)

(4) The approximation of (3) with the Gumbel distribution (I finally encountered this guy organically in the wild!)

(1) The coupon collector’s problem

Let $T$ be the random variable we’ve introduced: either the number of cereal boxes or Happy Meals to buy before collecting all the things, or the number of refreshes to hit before all $n = 35$ quotes are seen. $T$’s distribution actually doesn’t have its own name! I wish I could say that “$T$ is distributed coupon-collector with parameter $n = 35$,” but alas. Let’s just call it the CC distribution.

I’ll stick with page reloads as my example from here on out. What we can say about $T$ is that it can be decomposed as the sum of $n$ random variables, independent and distributed geometrically—but $not$ with identical parameters. So we can write $T = \tau_1 + \cdots + \tau_n$, where the individual $\tau_i$ terms are all differently parameterized geometric random variables.

Now let’s say we’ve seen 20 unique quotes so far. We’re claiming that the r.v. $\tau_{21}$ is geometric, so let’s define it as “the number of page reloads before getting a first success—i.e., the 21st unique quote.” The probability of the individual Bernoulli trials is $p_{21} = 1 - 20/35 = 15/35$, because a “success” in this setup has to be one of the 15 quotes you have left to see. The general formula for $p_i$ is then

[ p_i = 1 - \frac{i - 1}{n} = \frac{n - i + 1}{n} ]

The original thing we wanted to calculate was $\mathbf{E}[T]$. So we write, using linearity of expectations as well as the mean of a geometric r.v.:

[ \begin{aligned} \mathbf{E}[T] & = \mathbf{E}[\tau_1 + \cdots + \tau_n] \ & = \mathbf{E}[\tau_1] + \cdots + \mathbf{E}[\tau_n] \ & = \frac{1}{p_1} + \cdots + \frac{1}{p_n} \ & = \sum_{i=1}^n \frac{n}{n - i + 1} \ & = n \left(\frac{1}{n - 1 + 1} + \cdots + \frac{1}{n - n + 1} \right) \ & = n \sum_{i=1}^n \frac{1}{i} \ & = n H_n \end{aligned} ]

The sum of reciprocals of the first $n$ positive numbers has a special name: the $n$th harmonic number, $H_n$. The series of harmonic numbers begins:

[ 1, \frac{3}{2}, \frac{11}{6}, \frac{25}{12}, \frac{137}{60}, \frac{49}{20}, \frac{363}{140}, \ldots ]

Even at the low value of $n = 35$, the exact fraction of $H_{35}$ is unwieldy.

[ \mathbf{E}[T] = 35 \cdot \frac{54437269998109}{13127595717600} \approx 145.13735 ]

So you’d have to refresh my homepage 145 times on average to see all 35 quotes. And I had to look up $H_{35}$ in a series table. But if I didn’t have that table, and we wanted to calculate $\mathbf{E}[T]$ in constant time, that’s when we can turn to asymptotics.

(2) Continuous approximation of $\mathbf{E}[T]$

The harmonic numbers can be approximated by the (natural) log function, since summing up $1/i$ terms to get $H_n$ is like integrating $\int_{1}^{n} dx/x$, which gives you $\log n$. In fact, it can be shown that the error between the Harmonic numbers and this log approximation converges to a constant called $\gamma$.

[ \lim_{n\rightarrow\infty} \left(\sum_{i=1}^n \frac{1}{i} - \log n \right) = \gamma ]

The details are mechanical, but you just have to show that the sequence is monotonically decreasing and bounded below—Baby Rudin stuff that no one wants to revisit anymore at this point in life. This $\gamma$ limit has a special name: the Euler constant or the Euler–Mascheroni constant, approximately equal to 0.577. Therefore, a simple and accurate approximation for $\mathbf{E}[T]$ is

[ \begin{aligned} \mathbf{E}[T] & = n H_n \ & \approx n(\log n + \gamma) \end{aligned} ]

We can calculate this approximation at $n = 35$ using a package like numpy, which implements the Euler–Mascheroni constant.

import numpy as np

print(35 * (np.log(35) + np.euler_gamma))

We get 144.6397, which rounds to the same whole number as the exact solution above, 145 refreshes—but without any table lookups and in constant time.

(3) CDF of the CC r.v.

The cumulative distribution function of $T$ is the function $\Pr(T \leq t)$. For example, if I’ve refreshed $t = 100$ times, what’s the probability that I’ve seen all 35 quotes by that point?

The exact form of this CDF has to be derived from the inclusion–exclusion principle. (Remember that $T$ is an odd construction: a sum of geometric r.v.’s with different parameters. It makes sense that the CDF feels “custom” and combinatoric.) Let $A_i$ be the event that we never draw a specific quote $i$ after $t$ refreshes. If you union all the $A_i$ events, that is saying “never draw quote 1 OR never draw quote 2 OR … never draw quote $n$.” That union is exactly the complement of “I have seen all the quotes,” because the negation of “I have seen all the quotes” means that at least one of those OR conditions is true.

Thus, the CDF of $T$ is 1 minus the probability of unioning all the $A_i$ events.

[ \Pr(T \leq t) = 1 - \Pr \left(\bigcup_{i=1}^{n}A_i \right) ]

Now we can directly apply the inclusion–exclusion principle.

1, Inclusion: The first step is to add up all the $\Pr(A_i)$ terms. The probability that you never see quote i over 100 refreshes is $\left(\frac{35 - 1}{35}\right)^{100}$, and there are 35 instances of this. The sum of all $\Pr(A_i)$ terms is therefore

[ 35 * \left(\frac{35 - 1}{35}\right)^{100} ]

2, Exclusion: We now have to subtract all the two-way intersections. The probability that you never see two specific quotes i and j over 100 refreshes is $\left(\frac{35 - 2}{35}\right)^{100}$, and there are $\binom{35}{2}$ combinations of $i$ and $j$. So from the running total, we will be subtracting (excluding)

[ \binom{35}{2} \left(\frac{35 - 2}{35}\right)^{100} ]

This is a clean pattern. I will replace 35 with $n$ and 100 with $t$ for the general case now, which we can write as

[ \begin{aligned} \Pr(T \leq t) & = 1 - \Pr \left(\bigcup_{i=1}^{n}A_i \right) \ & = 1 - \sum_{k=1}^{n} (-1)^{k+1} \binom{n}{k} \left(\frac{n - k}{n}\right)^{t} \ & = 1 + \sum_{k=1}^{n} (-1)^{k} \binom{n}{k} \left(\frac{n - k}{n}\right)^{t} \ & = \sum_{k=0}^{n} (-1)^{k} \binom{n}{k} \left(\frac{n - k}{n}\right)^{t} \end{aligned} ]

This is a very compact expression for the CDF!

Unfortunately, this very compact expression can have many operations built into it. Calculating the exact solution gets into numerical stability issues for large $n$ and $t$ that are beyond my expertise. But a naïve Python implementation would go:

import math

def inclusion_exclusion(n, t):
    total = 0.0
    for k in range(n + 1):
        total += ((-1) ** k) * math.comb(n, k) * ((n - k) / n) ** t
    return total

inclusion_exclusion(35, 100) returns 0.1133. So there is an 11% chance that you’ve seen all the quotes after 100 refreshes. The lesson is to keep visiting my website.

Are you curious what happens if instead of 100 refreshes, we plug in the $\mathbf{E}[T]$ value of 145 that we got earlier? Maybe about 50% probability that you’ve seen everything by this mean of 145 refreshes? inclusion_exclusion(35, 145) returns 0.58. The CC distribution is actually right-skewed, so the mean is higher than the 50th percentile. Moreover, that 0.58 is tantalizingly close to $\gamma$ 👀👀

(4) Gumbel distribution

I want to find the continuous, limiting case of our CC random variable, $T$, as $n \rightarrow \infty$ so that we’re not dealing with numerical instability. Basically I want to approximate inclusion_exclusion() with a constant-time expression. In reading about this topic, I learned about the Fisher–Tippett–Gnedenko theorem. Amazingly, the central limit theorem has a kind of analogue when you are looking at the maximum of a set of random variables instead of the sum of them. Slightly edited from Wikipedia:

Let $X_1, X_2, \ldots, X_n$ be an $n$-sized sample of independent and identically-distributed random variables. Suppose that there exist two sequences of real numbers $a_n > 0$ and $b_n \in \mathbb{R}$ such that the following limit converges to a non-degenerate distribution function:

[ \lim_{n \rightarrow \infty} \mathbb{P} \left(\frac{\max \{ X_1, \ldots, X_n \} - b_n}{a_n} \leq x \right) = G(x) ]

In such circumstances, the limiting function $G$ is the cumulative distribution function of a distribution belonging to either the Gumbel, the Fréchet, or the Weibull distribution family.

The idea is that, if you take our CC-distributed $T$, subtract out some centering, and then divide by some scale, this will converge in the limit to a Gumbel distribution in our case. I won’t get into the differences with the Fréchet and Weibull cases. (All of this is analogous to the CLT, where you take a sum of i.i.d. r.v.’s, subtract out $n\mu$, and finally divide by $\sigma\sqrt{n}$ to get the convergence in distribution to the Gaussian.)

The thing is, we don’t have i.i.d. geometric r.v.’s in our original setup, and we haven’t introduced a max of r.v.’s yet. I have to get a little hand-wavy here, but we can adapt our setup to a Poisson approximation without changing $T$’s limiting distribution of a Gumbel. So let’s go about this section in two ways: (a) doing the math directly and manipulating expressions from our original setup (which might seem poorly motivated but is analytically clearer), and (b) changing to a setup with the Poisson and exponential distributions (which might seem hand-wavy by appearing to change the problem, but ultimately has a nice story).

(4a) Directly manipulating the exact CDF

The CDF of the standard Gumbel is $F(x) = e^{-e^{-x}}$ (YEAH THAT’S RIGHT). We can get to this by centering and scaling our $T$ and then finding the limiting distribution. Instead of $\Pr(T \leq t)$, we are going to recenter and rescale this as

[ \Pr\left(\frac{T - n \log n}{n} \leq x \right) ]

where $x = (t - n \log n)/n$ is our new, “standardized” variable. This comes from somewhere: notice that it’s kind of like getting a $z$-score when working with the CLT. Furthermore, $n \log n$ is also specially chosen: remember that it’s the first term in the asymptotic expression for $\mathbf{E}[T]$. It’s so relevant that it happens to make the following math work out perfectly 😅

Look back again at our original, exact CDF:

[ \begin{aligned} \Pr(T \leq t) & = \sum_{k=0}^{n} (-1)^{k} \binom{n}{k} \left(\frac{n - k}{n}\right)^{t} \ & = \sum_{k=0}^{n} (-1)^{k} \binom{n}{k} \left(1 - \frac{k}{n}\right)^{t} \ \end{aligned} ]

Substituting $t = n \log n + nx$, we get

[ \Pr\left(\frac{T - n \log n}{n} \leq x \right) = \sum_{k=0}^{n} (-1)^{k} \binom{n}{k} \left(1 - \frac{k}{n}\right)^{n \log n + nx} ]

Let’s work with the binomial coefficient first. Rewrite it as

[ \begin{aligned} \binom{n}{k} & = \frac{n(n - 1)\cdots(n-k+1)}{k!} \ & = \frac{n \cdot n(1 - 1/n) \cdots n\left(1-\frac{k-1}{n}\right)}{k!} \end{aligned} ]

The limit of this as $n \rightarrow \infty$ is $n^{k}/k!$, as everything in parentheses goes to 1.

Now let’s rewrite the $(1 - k/n)^{n \log n + nx}$ term. If we log and exponentiate, we have

[ \exp \left((n \log n + nx) \log \left(1 - \frac{k}{n} \right) \right) ]

Taking the second log part, we can rewrite it as $-k/n + O(1/n^2)$, which follows from the Taylor series expansion of $\log(1-u)$, with $u = k/n$. At that point, we get

[ \exp \left(-k \log n -kx + O\left(\frac{\log n}{n} \right) \right) \rightarrow n^{-k} e^{-kx} ]

Putting this all together with the limit of the binomial coefficient, we finally obtain

[ \begin{aligned} \lim_{n \rightarrow \infty} \Pr\left(\frac{T - n \log n}{n} \leq x \right) & = \sum_{k=0}^{\infty} (-1)^{k} \frac{n^{k}}{k!} \cdot n^{-k} e^{-kx} \ & = \sum_{k=0}^{\infty} \frac{(-1)^{k} e^{-kx}}{k!} \ & = \sum_{k=0}^{\infty} \frac{(-e^{-x})^{k}}{k!} \ & = e^{-e^{-x}} \end{aligned} ]

where the last step is from the Taylor series of the exponential function, but for $-e^{-x}$ itself. We’ve finally gotten to the Gumbel CDF, even if the original substitution for centering and scaling seemed out of the blue.

Let’s now use this as an approximation of our previous inclusion_exclusion(). If you want to keep the original args of $n$ and $t$, you have to calculate the change of variable within the function.

import math

def gumbel_cdf_approx(n, t):
    x = (t - n * math.log(n)) / n
    return math.exp(-math.exp(-x))

gumbel_cdf_approx(35, 100) returns 0.1340. A little off from the exact solution of 0.1133 that we got earlier, but the approximation gets better as $n \rightarrow \infty$, and this is numerically stable.

For the final subsection, let’s approach the problem from an alternative angle: less brute-force analytical, and arguably more elegant by altering the distributions at play.

(4b) Poisson and exponential story

A continuous-time version of the coupon collector’s problem makes these alterations:

The refreshes are now a rate-1 Poisson counting process.
Instead of geometric r.v.’s $\tau_1, \ldots, \tau_n$, we’re going to consider a continuous quantity instead. Let’s redefine these r.v.’s as “the time until the first quote $i$ is seen.” Call them $\xi_1, \ldots, \xi_n$ to emphasize the change. The individual quotes have independent exponential arrival times with parameter $\lambda = 1/n$.
We are interested in the slowest quote to show up, which happens to be the max of all the exponential random variables. Thus, let $\xi = \max \{\xi_1, \ldots, \xi_n \}$ be this max exponential time of the slowest quote.
- By the Fisher–Tippett–Gnedenko theorem, this will converge to one of the three extreme-value distributions—in our case, the Gumbel.

This was a conceptual leap for me, but it works and makes the original setup a little more elegant. Instead of a discrete-time setup with geometric r.v.’s of different parameters, the story gets cleaner and we have i.i.d. exponential r.v.’s. And in this adapted version, we’re now looking at the problem through the lens of a max of i.i.d. random variables (slowest individual arrival time) instead of a sum of non-i.i.d. ones (sum total of refreshes to see all the quotes).

The CDF of the exponential distribution is wonderfully simple: $F(t) = 1 - e^{-\lambda t}$. With a rate of $\lambda = 1/n$ and with $n$ independent r.v.’s, we can take the product of that CDF, $n$ times.

[ \begin{aligned} \Pr(\xi \leq t) & = \prod_{i=1}^{n} \Pr(\xi_i \leq t) \ & = \left(1 - e^{-t/n} \right)^{n} \end{aligned} ]

With the same $t = n \log n + nx$ substitution that we made in the previous subsection, we get

[ \begin{aligned} \Pr(\xi \leq n \log n + nx) & = \left(1 - e^{-(n \log n + nx)/n} \right)^{n} \ & = \left(1 - e^{-x - \log n} \right)^{n} \ & = \left(1 - e^{-x}/n \right)^{n} \ & \rightarrow e^{-e^{-x}} \end{aligned} ]

We once again obtain the standard Gumbel CDF. In the previous subsection, the Taylor series of the exponential function (of $-e^{-x}$) gave us the Gumbel CDF. Here, it’s that other fundamental definition of the exponential function that makes the last step work: $\lim_{n \rightarrow \infty} \left( 1 + \frac{1}{n} \right)^{n}$, a.k.a. continuous compounding for the finance bros.

(5) Maxmaxxing

Section 4 shows two ways of looking at the limiting case of the CDF of the coupon-collector random variable. Who knew that maxes of random variables had their own convergence theorem? And all of this because I wanted people to read all the quotes on my website. Call me Melville, who put all those Extracts at the beginning of Moby-Dick.