Tuesday, April 3, 2012

The Reflexive Density Constant

Before proceeding with further discussion, it's useful at this point to introduce what I call the "reflexive density constant", or "R". This may sound totally unrelated to error detection and secure hashes, but bear with me.

What is R? The best way to define it is procedurally:


1. Set all N bits in a bitfield to 0.

2. N times, choose one of the bits in the bitfield at (true) random. Set the bit to 1, regardless of its current value.

3. Count all the 1s in the bitfield. Divide by N. R is the limit of this ratio as N approaches infinity.


There is an infinite expression tree for R, which I don't have time to investigate. So I approximated it using a true random number generator. It appears to be close to 0.6321...

This value appears to equal 1-1/e. (Isn't transcendental reverse approximation cool? Thank you, WolframAlpha!) [EDIT: This turns out to be correct, as proven here.] Granted, this could all just be an elegant coincidence, as I have no formal proof. Either way, I surmise that an infinite Bayesian logic tree is at the heart of R. [EDIT: Alternatively, one could use the Jytter userspace true random number generator to arrive at an empirical approximation of R.]

1-1/e also comes up in the online stochastic bipartite matching problem. Is this significant? I don't know and don't have time to find out, but maybe you do.

Anyway, what about the name? "Reflexive" refers to choosing N bits at random from a bitfield which contains equally many bits. "Density" refers to the fraction of 1s. "Constant" reflects the apparent constancy of R in the infinite limit of N.

Now, how does this relate to pseudorandom number generators (PRNGs), EDCs, and secure hashes? Well, it turns out that compliance with R is a very good measure of the quality of each of these:

R AND PSEUDORANDOM NUMBER GENERATORS

For a PRNG to look convincing, its outputs must mimic an unbiased true random number generator (TRNG). We can ascertain the quality of this mimicry by running the test above using the PRNG instead of a TRNG. The result should be close to R. If it's much less than R, then the PSNG is getting trapped in a small subset of possible outputs (a tight orbit). If it's much more than R -- in the worst case, 1 -- then it's starting to act like an out-of-order counter.

R AND ERROR DETECTION CODES

A good EDC requires a PRNG of the above criteria. Otherwise, who knows what data resonances might occur, which would weaken said EDC. All multiply-with-carry generators in the LMD family appear to adhere to within 1% of R.

R AND (SECURE) HASH FUNCTIONS

Consider the hash function H(L), where L is a message and H(L) is its hash. Given L', a derivative of L, which is probably only slightly different, e.g. a single bit flipped. We can then crack this hash by finding some integer U such that:


H(L)=H(L') xor U


Simply xor U to H(L'), and presto, L' appears to be legit. (The xor could be replaced by addition or some other operator, depending on the cryptographic context. The point is, if it's easy to find U, the "difference" of the hashes, then it's easy to crack the hash.)

We can also crack the hash by finding L' such that:


H(L)=H(L')


but in practice, because U is generally much shorter than L', it's easier to take L' as a given, and solve for U.

Now... how do we know how easy it is to solve for U, in the general case of a long hash? We don't -- at least, not precisely. However, by gathering statistics from similar hashes of tractable length (perhaps 24 bits or less, given the limits of desktop computer power), we can often extrapolate to hashes of similar construction but merely longer length. Is this foolproof? No. But it's probably the best that we can do, absent more computer power. (And if we had enough computer power to do a complete analysis of the real hash, then, prima facia, the hash would be of insufficient length to be secure.) At least, if the short hash turns out to be weak, then we know that the long hash is in jeopardy. If the short hash turns out to be strong, then we can attempt to try as many samples as possible, in order to bolster our faith that the real hash is also strong.

How does all this relate to R? Well, think of U as an N-bit integer, where N is equal to the number of bits in the hash itself. Then R tells us that ~63% of the 2^N possible U values are actually roots of the hash with respect to L vs. L'. The number of actual roots, divided by 2^N, is what I call the "root density", D. If D is not close to R, then we have a weak (or, suboptimally strong) hash. Why?

Here's where your brain has to twist... fasten your seatbelt.

First of all, remember that we can't gather statistics on a hash of "interesting" length because, after all, the length of the hash is meant to hamper all attempts to solve for U. So, we're solving for U on hashes of tractable length, but similar construction, in the hopes that the statistics are extapolatable to the actual hash in question. (It's easy to solve for U in the above case, for example. Just compute H(L) and H(L'), then xor them together. Depending on the cryptographic context, this process could be more complex, but the concept is the same.)

So just create 2^N versions of L', and count the populations of all associated values of U. What you get is a population histogram of U over all 2^N versions of L'. (The same process allows us to compare how H(L) differs from H'(L), where H and H' differ by only their pseudorandom seeds. Suffice to say, this technique can be applied to a wide variety of hash modifications.)

So the question is, how many roots (unique values of U) are there, compared to how many there could possibly be (2^N)? For example, if we find only 3 unique values of U, which occur over and over again, then the answer is 3/(2^N). If we find all 2^N values, then the answer is 1. I would suggest that we want the answer (D) to be as close to R as possible.

To understand why, first consider the extreme cases: 1 unique root, and 2^N unique roots. In the first case, it might be that the binary equation trees for deriving U from L and L' (and yes, there are such equations, despite the fact that you couldn't write them down in a lifetime) might have a hidden simplicity, which a smart computer could discover. In the second case, a comparable argument prevails: the set of solutions is "too perfect", such that every possible root is in fact a root. In this case, there might be a boring linear relationship between L, L', and U. In the absence of any other information, then, it's compelling to believe that the binary equations reach maximum complexity when D=R.

The point here is wider than simply creating memory capacity pains for classical computers attempting to subvert the hash. There's a quantum computer issue here, as well: if the roots appear to have been chosen at random, then it stands to reason that a quantum computer will behave as if they are in fact arranged at random, which is to say that they will represent (unstable) excited states for the wave function, eluding the quantum computer's attempts to discover the global minimum, i.e. the particular value of U which legitimizes a particular L'.

But there's one other consideration for the quantum case: root multiplicity. Essentially, the larger the ratio between maximum and (nonzero) minimum root multiplicity (the "root multiplicity ratio"), the greater the odds that a quantum computer will find the root with maximum multiplicity.

What distribution of root multiplicities would we expect if we chose 2^N values of U at random? That's a question related to Poisson distributions and beyond the scope of this blog. (I call it the "Poisson Criterion".) Suffice to say that if my extrapolations from shorter hashes are valid, then LMD4, LMD5, and LMD6 all exhibit root multiplicity ratios close enough to what a Poisson distribution would dictate, as to be unhelpful to the best known attack method, which is a birthday attack.

Got it? Good. Now we can talk about secure hashes...

No comments:

Post a Comment