Back to Other examples. Forward to A brief history of the algorithms. Up to Contents.

Here we give a brief description of the Algorithm N for sampling exactly from the unit normal distribution. For details see Karney (2016). This also includes 2 improvements given by

Y. Du, B. Fan, and B. Wei,
An improved exact sampling algorithm for the standard normal distribution,
preprint: arXiv:2008.03855v1 (Aug. 2020).

Jump to

Here is the improved Algorithm N.

N1. [Sample square of integer part of deviate $n = k^2$.] Select integer $n \ge 0$ with probability $\exp(-\frac12 n) (1 - 1/\sqrt e)$.
N2. [Testing whether $n$ is a perfect square.] If $n$ is a perfect square set $k \leftarrow \sqrt n$; otherwise go to step N1.
N3. [Sample fractional part of deviate $x$.] Set $x \leftarrow U$.
N4. [First adjustment of the relative probability of $x$.] Accept $x$ with probability $\exp(-kx)$; otherwise go to step N1.
N5. [Second adjustment of the relative probability of $x$.] Accept $x$ with probability $\exp(-\frac12 x^2)$; otherwise go to step N1.
N6. [Combine integer and fraction and assign a sign.] Set $y \leftarrow k + x$; then with probability $ \frac12 $, set $y \leftarrow -y$.
N7. [Return result.] Set $N \leftarrow y$.

There are two changes compared to Karney (2016).

Steps N1 and N2 sample $k$ with relative probability $ \exp(-\frac12 k^2) $. This change, a reformulation of the change proposed by Du et al. (2020), provides an improvement of steps N1 and N2 in Karney (2016).
In steps N4 and N5, the pair $(k,x)$ is accepted with probability $ \exp(-kx) \exp(-\frac12 x^2) = \exp\bigl(-\frac12 x(2k+x)\bigr) $. This change, also suggested by Du et al., performs the same function as step N4 in Karney (2016), but more efficiently. Step N6 combines steps N5 and N6 in Karney (2016).

The new step N4 is simple to implement:

N4. [Break N4 into $k$ steps.] Perform up to $k$ Bernoulli trials, $E(x)$, Accept $x$ if they are all true $\mathrm{true}$; otherwise go to step N1.

The Bernoulli trial $E(x)$, returns true with probability $\exp(-x)$ for $x \in (0,1)]$. It is implemented by Algorithm E:

E1. [Generate a run.] Sample a sequence of uniform deviates $U_1, U_2, \ldots$ and determine the maximum value $n \ge 0$ such that $x > U_1 > U_2 > \ldots > U_n$.
E2. [Test length of run.] Set $E(x) \leftarrow (\text{$n$ is even}).$

Here $U$ is a uniform deviate $U \in (0,1)$.

The step N5 is implemented as a specialization of step N4 in Karney for the case $k = 0$. This is the Bernoulli trial $B(x)$, which returns true with probability $\exp(-\frac12x^2)$ for $x \in (0,1)$, which is realized by Algorithm B, given below. Here step B2(b) generates a sequence of uniform deviates $U_1, U_2, \ldots$ such that $x > U_1 > U_2 > \ldots $; similarly step B2(c) generates a sequence of uniform deviates $V_1, V_2, \ldots$ such that $V_i < x$; at each step we also include a coin toss, step B2(a).

B1. [Initialize loop.] Set $y \leftarrow x$, $n \leftarrow 0$.
B2. [Generate and test next samples.]
(a) [The coin toss] With probability $ \frac12 $, go to step B4.
(b) [The $U_i$ sequence] Sample $z \leftarrow U$; go to step B4, unless $z < y$.
(c) [The $V_i$ sequence] Sample $r \leftarrow U$; go to step B4, unless $r < x$.
B3. [Increment loop counter and repeat.] Set $y \leftarrow z$, $n \leftarrow n + 1$; go to step B2.
B4. [Test length of runs.] Set $B \leftarrow (\text{$n$ is even})$.

The probability that step B2 is entered with a particular value of $n$ is

\[ \frac1{2^n} \times \frac{x^n}{n!} \times x^n, \]

where the separate factors are due to the $ n $ previous invocations of steps B2(a), B2(b), and B2(c). This realization of step N5 is slightly different from that proposed by Du et al. (2020).

This improvements are easily migrated into the implementation of sampling from the discrete normal distribution, discrete_normal_dist.

Splitting the algorithm into stages

To analyze the algorithm further, let consider regard algorithm as successively generating the following distributions:

The input to step N1, a Bernoulli trial, $ f_0(k) $.
The output of step N1, an exponential distribution, $ f_1(k) $.
The output of step N2, a discrete normal distribution with positive support, $ f_2(k) $.
The output of step N3, $ f_2 $ extended to the reals, $ f_3(y) $.
The output of step N4, the first adjustment towards the normal distribution, $ f_4(y) $.
The output of step N5, the normal distribution with positive support, $ f_5(y) $.
The output of step N6, the normal distribution, $ f_6(y) $.

Explicit formulas for these distributions are

\[ \begin{align} f_0(k) &= \begin{cases} 1-p, & \text{if $k=0$,} \\ p, & \text{if $k=1$,} \end{cases}\\ f_1(k) &= \frac{p^k}{C_1}, \quad\text{for $ k \ge 0$,}\\ f_2(k) &= \frac{p^{k^2}}{C_2}, \quad\text{for $ k \ge 0$,}\\ f_3(y) &= f_2(k) = \frac{\exp(-\frac12 k^2)}{C_3}, \quad\text{for $ y \ge 0$,}\\ f_4(y) &= \frac{\exp(-\frac12 k^2) \exp(-kx)}{C_4}, \quad\text{for $ y \ge 0$,}\\ f_5(y) &= \frac{\exp(-\frac12 y^2)}{C_5}, \quad\text{for $ y \ge 0$,}\\ f_6(y) &= \frac{f_5(\left|y\right|)}2, \quad\text{for all $ y$.}\\ \end{align} \]

Here $ p = 1/\sqrt e $ and $ x = y - k $, with $ k = \lfloor y \rfloor $. We also define the constants,

\[ \begin{align} C_0 &= 1/(1-p)^2 = 6.459192,\\ C_1 &= 1/(1-p) = 2.541494,\\ C_2 &= \sum_{k=0}^\infty p^{k^2} = 1.753314,\\ C_3 &= C_2 ,\\ C_4 &= 1 + \sum_{k=1}^\infty p^{k^2} \frac{1-p^{2k}}k = 1.445512,\\ C_5 &= \sqrt{\pi/2} = 1.253314,\\ C_6 &= C_5 . \end{align} \]

Here $C_i$ for $ i \in [1, 5] $ are the normalizing constants for the corresponding distributions $ f_i $. $C_0$ and $C_6$ are added so that the ratios $ C_i / C_j $ for $ i < j$ give the average number of samples from $ f_i $ needed to produce one sample from $ f_j $.

We see that $ C_2 \approx C_5 + \frac12 $ quite accurately. This reflects the fact that the rectangle rule yields a good result for the integral of the normal distribution even when the step size is 1. We also remark that the infinite sums for $C_2$ and $C_4$ and those given below converge very rapidly.

Later on, we will need the entropies of these distributions. The entropy, here measured in bits, is defined as mean value of $-\log_2(f_i)$. We have

\[ \begin{align} H_0 &= -p\log_2 p - (1-p)\log_2(1-p) = 0.967002,\\ H_1 &= H_0/(1-p) = 2.457630,\\ H_2 &= 1.325722,\\ H_3 &= H_2 = 1.325722,\\ H_4 &= 1.059279,\\ H_5 &= \log_2(\sqrt{\pi e/2}) = 1.047096,\\ H = H_6 &= \log_2(\sqrt{2\pi e}) = H_5 + 1 = 2.047096 .\\ \end{align} \]

Coarse-grained analysis

Du et al. (2020) give a comprehensive analysis of the algorithm accounting for the consumption of u-rands. Now, exrandom uses a u-rand to perform a coin toss, e.g., to select the sign of the normal deviate in step N6. Because this use of a u-rand will only use a single random digit when the base $b$ is even (a "bit" when $ b = 2 $) and $b/(b-1)$ digits, on average, when $b$ is odd, we will separately account this use of u-rands as "coin tosses".

u-rands are used in steps N1, N3, N4, N5 (steps B2(b) and B2(c) in Algorithm B). Coin tosses are used in steps N5 (step B2(a) in Algorithm B) and N6.

Here is the accounting of coin tosses and u-rands:

N1: Step N1 uses u-rands through its sampling of $f_0$. On average $f_0$ is invoked $C_0/C_1$ times by N1 and N1 is invoked $C_1/C_6$ times to produce a normal deviate. Each invocation of $f_0$ consumes $u_0 = \sqrt e$ u-rands (von Neumann, 1951). So step N1 requires $u_1 = u_0 C_0/C_1$ u-rands per invocation, or
\[ U_1 = u_1 C_1/C_6 = 8.496998 \]
u-rands per normal deviate.
N3: Step N3 uses $u_3 = 1$ u-rand to return as the fractional part of the normal deviate. N3 is called $C_3/C_6$ times on average for each normal deviate. So the consumption of u_rands by N3 is
\[ U_3 = u_3 C_3/C_6 = 1.398942 \]
u-rands per normal deviate.
N4: In this step, $E(x)$ is invoked $k$ times. Each invocation of $E(x)$ requires, on average $e^x$ u-rands (von Neumann, 1951). After $n$ rounds, the (incoming) distribution of $x$ is $e^{-nx}$ and the test $E(x)$ is applied (on the $(n+1)th$ round) for $k\ge n+1$, to a fraction of samples $F_{n+1}$ where
\[ F_n = \sum_{k=n}^{\infty} f_2(k) \]
is the complementary cumulative function of $f_2$ and the number of u-rands, averaged over $x$ is $e_{n-1}$, average value of $e^x \times e^{-nx}$, where
\[ e_n = \int_0^1 e^{-nx}\, dx = \frac{1-e^{-n}}n, \]
with the special case, $e_0 = 1$. The total number of u-rands for each attempt at step N4 is then
\[ u_4 = \sum_{n=0}^\infty e_{n-1} F_{n+1}. \]
This quantity is the average number of u-rands needed for each attempt at step N4, i.e., per result from step N3. So the average number of u-rands consumed by step N4 per normal deviate is
\[ U_4 = u_4 C_3/C_6 = 1.155795. \]
N5: In this step $B(x)$ is invoked. In Algorithm B, we determine the average numbers of times steps B2(a), B2(b), B2(c) are run by summing up the probabilities that the steps are encountered at successive iterations of the main loop:
\[ \begin{align} \text{B2(a):}\quad & \sum_{n=0}^\infty \frac1{2^n} \frac{x^n}{n!} x^n = \exp\biggl(\frac{x^2}2\biggr), \\ \text{B2(b):}\quad & \sum_{n=0}^\infty \frac1{2^{n+1}} \frac{x^n}{n!} x^n = \frac12\exp\biggl(\frac{x^2}2\biggr), \\ \text{B2(c):}\quad & \sum_{n=0}^\infty \frac1{2^{n+1}} \frac{x^{n+1}}{(n+1)!} x^n = \frac1x\biggl(\exp\biggl(\frac{x^2}2\biggr)-1\biggr). \end{align} \]
The average number of coin tosses is given by the expression for B2(a), while the number of u-rands is given by summing the expressions for B2(b) and B2(c). These quantities need to be weighted by the distribution of $x$ in the output from step N4, which is
\[ g_4(x) = \sum_{k=0}^\infty f_4(k + x) = \frac1{C_4} \sum_{k=0}^\infty \exp(-\frac12 k^2) \exp(-kx). \]
The mean number of coin tosses and u-rands per invocation of step N5 is thus
\[ \begin{align} b_5 &= \int_0^1 \exp\biggl(\frac{x^2}2\biggr) g_4(x)\,dx, \\ u_5 &= \int_0^1 \biggl[\biggl(\frac12 + \frac1x\biggr)\exp\biggl(\frac{x^2}2\biggr) - \frac1x\biggr] g_4(x)\,dx. \end{align} \]
The first integral can be performed in closed form (yielding an infinite sum over error functions); the second integral was performed numerically (setting the term in square brackets to $\frac12$ in the limit $x\rightarrow0$). The mean number of coin tosses and u-rands used by step N5 per normal deviate is
\[ \begin{align} B_5 &= b_5 C_4/C_6 = 1.358922,\\ U_5 &= u_5 C_4/C_6 = 0.987790. \end{align} \]
N6: This involves a single coin toss to determine the sign of the normal deviate, i.e.,
\[ B_6 = b_6 = 1. \]

Combining these results, the average number of coin tosses required per random deviate is

\[ B_c = B_5 + B_6 = 2.358922, \]

and the average number of u-rands required

\[ U = U_1 + U_3 + U_4 + U_5 = 12.039525. \]

The subscript in $B_c$ indicates this is the number of coin tosses.

It is easy to confirm these results by instrumenting the unit_normal_dist class to count the use of coin tosses and u-rands. Alternatively, unit_normal_dist can be run using random digits with a large base; the largest supported base is $2^{32}$ (obtained by setting the template parameter b = 0 for rand_digit). With such a large base, unit_normal_dist will rarely (about 1 time in $10^9$) have to sample more than one digit of a u-rand; and because, in exrandom, coin tossing is implemented with u-rands, we expect that the average number of random digits required to produce a normal deviate to be $B_c + U$. However this overlooks the fact that unit_normal_dist can return a result with no digits sampled. This happens when $k = 0$ (so step N4 does nothing) and if the first coin toss in step B2(a) causes algorithm B to exit immediately, with the u-rand, $\pm 0\ldots$ being returned as the normal deviate. The fraction of normal deviates with an "unsampled" fraction is $ 1/\sqrt{2\pi} = 0.398942 $ (a rather substantial fraction). So the number of random digits required to produce a normal deviate (as a u-rand) in the limit of a large base is about

\[ B_c + U - 1/\sqrt{2\pi} = 13.999505. \]

Here is a table comparing the current results (last column) with those of Karney (2016) and Du et al. (2020). The analysis of Karney (2016) was given in Du et al. (2020). The only change between Du et al. (2020) and the current method is in step N5.

Quantity	Karney, 2016	Du et al, 2020	current
$B_5$	0.953438	0	1.358922
$B_6$	1
$B_c$	1.953438	1	2.358922
$U_1$	8.496998
$U_2$	2.635128	0
$U_3$	1.398942
$U_4$	3.0695	1.155795
$U_5$	3.0695	1.667251	0.987790
$U$	15.6005	12.718986	12.039525

The data for Karney (2016) are given by:

$B_5$ is obtained using the same formula as above but including only the $k=0$ term in the sum in the definition of $g_4(x)$.
$U_2$ is the number of u-rands consumed in the original step N2. This is given by $U_2 = u_0 (C_1 - C_2) C_1/C_6$.
$U_4$ is obtained from the analysis of Du et al., $U_4 = 2.19414\, C_3/C_6$.

The data for Du et al (2020) are found as follows:

$U_5$ is obtained by replacing the factor $\bigl(\frac12 + \frac1x\bigr)$ by $\bigl(1 + \frac1x\bigr)$ in the expression for $u_5$ above.

Bit-level analysis

The analysis in the previous section focused on measuring the consumption of u-rands. This is a reasonable tactic if randomness is obtained in big chunks, e.g., as a random unsigned int, which is equivalent to setting the base for random digits exrandom to $2^{32}$. However, the algorithm makes much more parsimonious use of random data when the base is 2 and the randomness is delivered as a stream of random bits. Unfortunately, the analysis of the algorithm at the level of detail is much more difficult. So here we just provide empirical data on the use of random bits. In the table below we compare the current version of the algorithm with Karney (2016) and Du et al. (2020). The quantities tabulated here are:

$B_k$, the average number of bits used in steps N1 and N2 to sample the integer part of the normal deviate, $k$.
$B_x$, the average number of bits used in steps N3, N4, and N5 to sample the fractional part of the normal deviate, $x$.
$B_s = 1$, the number of bits used in step N6 to sample the sign of the normal deviate, $s$.
$B = B_k + B_x + B_s$, the total number of bits needed to produce a normal deviate, on average.
$L$, the number of bits in the fraction of the u-rand representing the normal deviate.
$C = B - L$, the "cost" of the algorithm. Adjusting $B$ by $L$ allows different algorithms of producing normal deviates to be compared fairly.
$F = C - Q + p + 1$, the average number of bits needed to produce a corrected rounded double precision representation of a normal deviate. Here $Q = -0.416638$ is the mean floating-point exponent of a normal deviate and $p = 53$ is the number of bits in the fractional part of a double precision number.
$T = C - H$, the "toll" of the algorithm, where $H=2.047096$ is the entropy (in bits) of the normal distributions. The adjustment by $H$ allows algorithms for different random distributions to be compared.

Quantity	Karney, 2016	Du et al, 2020	current
$B_k$	19.2143	14.6660
$B_x$	9.7858	8.7825	8.3523
$B_s$	1
$B$	30.0002	24.4486	24.0183
$L$	1.5558	1.7770	1.4423
$C$	28.4443	22.6715	22.5760
$F$			76.9926
$T$			20.5289

We see that the cost has improved by about 5.87 bits compared to Karney (2016) with most of the improvement, 4.55 bits, coming from the increased efficiency of sampling $k$. The improvement of the current method over Du et al. (2020) is modest, 0.10 bit. The values for $B_k$ are simply related to $U_{1,2}$ in the previous section by

\[ B_k = B_0 (U_1 + U_2) / u_0, \]

where the division by $u_0 = \sqrt e$ gives the number of invocations of the Bernoulli trial $f_0(k)$ and $B_0 = 2.84574$ is the average number of bits per trial (determined empirically).

We can determine various contributions to the toll:

In sampling $k$ from the discrete normal distribution with positive support:
- the extrinsic toll from imperfect sampling from $f_0(k)$;
- the intrinsic toll from the rejection method to obtain $f_2(k)$ from $f_1(k)$.
In sampling $x$:
- the intrinsic toll from first sampling $k$ and a uniform $x$ and then applying a rejection method;
- the extrinsic toll from imperfect Bernoulli sampling to implement the rejections.

We use the terminology, "perfect sampling", to refer to sampling with no toll (i.e., the number of bit used equals the entropy gain. Obviously the step of assigning the sign $s$ does not contribute to the toll. It takes one bit, but the full normal distribution has 1 bit additional entropy compared to the one-sided normal distribution.

The contributions to the toll in sampling $k$: Each normal deviate requires $C_0/C_6$ samples of $f_0(k)$. The toll for each invocation of $f_0(k)$ is $B_0 - H_0 = 1.87874$. The extrinsic toll per normal deviate is then

\[ T_{ke} = B_k - H_0 C_0/C_6 = 9.6824. \]

The intrinsic toll is the difference in the entropy provided by $f_0(k)$ and the entropy in the resulting distribution $f_2(k)$. This toll is then

\[ T_{ki} = H_0 C_0/C_6 - H_2 C_2/C_6 = 3.1290. \]

Summing these terms, we have

\[ T_k = T_{ke} + T_{ki} = B_k - H_2 C_2/C_6 = 12.8114 \]

as the toll from sampling $k$.

The contributions to the toll in sampling $x$: In addition to the entropy provided by the sampled $k$, we need to perform Bernoulli trials with probability $p(k,x) = \exp(-kx-\frac12 x^2/2)$. Assuming that these were perfectly sampled, the resulting entropy input is obtained by averaging the entropy of a single trial,

\[ h_b(k,x) = -p(k,x) \log_2 p(k,x) - (1 -p(k,x)) \log_2(1-p(k,x)), \]

over $k$ and $x$, to obtain

\[ H_b = \int_0^1 dx \sum_{k=0}^\infty f_2(k) h_b(k,x) = 0.629964. \]

The intrinsic toll is then

\[ T_{xi} = (H_2 + H_b) C_3/C_5 - H_5 = 1.6888. \]

The extrinsic toll due to imperfect Bernoulli trial is then

\[ T_{xe} = B_x - L - H_b C_3/C_5 = 6.0287 \]

and the total toll from sampling $x$ is

\[ T_x = T_{xe} + T_{xi} = B_x - L + H_2 C_3/C_5 - H_5 = 7.7175. \]

For completeness, let us add the toll due to selecting the sign of the normal deviate,

\[ T_s = B_s + H_5 - H_6 = 0; \]

and finally the total toll is

\[ \begin{align} T &= B_k + B_x + B_s - L - H_6\\ &= T_{ke} + T_{ki} + T_{xe} + T_{xi} + T_s \\ &= 9.6824 + 3.1290 + 6.0287 + 1.6888 + 0 \\ &= T_k + T_x \\ &= 12.8114 + 7.7175 \\ &= 20.5289. \end{align} \]

We can attempt to continue this analysis to divide $T_x$ between steps N4 and N5. However this is difficult because step N4 partially samples the fraction $x$ and this affects the performance of step N5. Here is the raw empirical data which might be useful:

\[ \begin{align} B_{x4} &= 3.8627, \\ B_{x5} &= 4.4896, \\ L_4 &= 0.7566, \\ L_4' &= 0.6618, \\ L_5 &= 0.7805. \end{align} \]

Here $B_{x4}$ and $B_{x5}$ are the average number of bits consumed in steps N4 and N5 per normal deviate ( $B_x = B_{x4} + B_{x5}$), $L_4 C_5/C_4 = 0.6560$ is the average number of bits in the fraction of $x$ after N4 concludes successfully, $L_4'$ is the average number of bits in the fraction of $x$ at the start of step N5, considering only those values which are accepted by this step, $L_5$ is the number of bits added to the fraction of $x$ by step N5 ( $L = L_4' + L_5$).

A deeper analysis of the algorithm is daunting. The analysis of the much simpler algorithm for sampling from the exponential distribution given by

P. Flajolet and N. Saheb,
The complexity of generating an exponentially distributed variate,
J. Algorithms 7(4), 463–488 (1986).
doi: 10.1016/0196-6774(86)90014-3

runs to 26 pages. However we note that the biggest contribution to the toll is the inefficiency of Bernoulli sampling with probability $p$, i.e., sampling $f_0(k)$. Flajolet and Saheb mention a suggestion by Mike Paterson for how the toll might be reduced. In the current implementation, when two u-rands are compared, $x < $y, 2 bits are drawn sequentially for both $x$ and $y$ until a mismatch is found. Paterson suggested instead that 1 bit be used to decide whether the corresponding digits of $x$ and $y$ match. An additional bit is used only if there's a mismatch to decide which u-rand is smaller. (Note that this idea only applies for base $b = 2$.) Implementing this for $f_0(k)$ is rather simple: it results is a reduction of $B_0$, the average number of bits needs to sample $f_0(k)$ by $\delta B_0 = 0.25085$ (from $2.84574$ to $2.59489$). $f_0(k)$ is invoked an average of $C_0/C_6 = 5.153690$ times per normal deviate, so the corresponding reduction in the toll is $\delta T_{ke} = \delta B_0 C_0/C_6= 1.2928$ to $T = 19.2361$. Similar changes could be made in the implementations of N4 and N5.

Back to Other examples. Forward to A brief history of the algorithms. Up to Contents.

Quantity	Karney, 2016	Du et al, 2020	current
\(B_5\)	0.953438	0	1.358922
\(B_6\)	1
\(B_c\)	1.953438	1	2.358922
\(U_1\)	8.496998
\(U_2\)	2.635128	0
\(U_3\)	1.398942
\(U_4\)	3.0695	1.155795
\(U_5\)	3.0695	1.667251	0.987790
\(U\)	15.6005	12.718986	12.039525

Quantity	Karney, 2016	Du et al, 2020	current
\(B_k\)	19.2143	14.6660
\(B_x\)	9.7858	8.7825	8.3523
\(B_s\)	1
\(B\)	30.0002	24.4486	24.0183
\(L\)	1.5558	1.7770	1.4423
\(C\)	28.4443	22.6715	22.5760
\(F\)			76.9926
\(T\)			20.5289