Thursday, December 29, 2016

the digital library of babel

Jorge Luis Borges introduced the concept of the Library of Babel,  a "vast library containing all possible 410-page books of a certain format and character set." To further quote wikipedia,
Though the order and content of the books is random and apparently completely meaningless, the inhabitants believe that the books contain every possible ordering of just 25 basic characters (22 letters, the period, the comma, and the space). Though the vast majority of the books in this universe are pure gibberish, the library also must contain, somewhere, every coherent book ever written, or that might ever be written, and every possible permutation or slightly erroneous version of every one of those books.
The other day, my company CarGurus had a lunch and learn about the internals of git. I've always been impressed with how quick updates were once you've cloned a repository. In part that's because of how git stores an archive with a compressed version of every version of every file and folder your project has generated, and so chances are in doesn't have to pull down that much fresh data. What's really clever is how it stores them; each is in a physical file named after the SHA-1 hash of the file contents (each physical file sits in a folder named for the first two hex digits of the 40-hex-digit hash, you can see those folders in the .git/objects/ dir of your git project.)

SHA-1 is really amazing, because it's SO amazingly unlikely that two different files will generate the same hash. This page describes it as
Here’s an example to give you an idea of what it would take to get a SHA-1 collision. If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (3.6 million Git objects) and pushing it into one enormous Git repository, it would take roughly 2 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.
So. Many years ago, an awesome experimental site word.com (now sadly defunct, the domain bought out by Merriam-Webster) ran a subsite called Pixeltime - you can read about it at my tribute page, but the upshot was it was an online graphic editor slash contest with an emcee the Pixel Master whom I've described as "a cross of Mr. Rogers and Max Headroom via Blue Man Group". Each image was 45x45, with a palette of 16 colors. (I made some visual basic hacks that let me essentially upload photos in 5 shades of gray by grabbing the mouse and clicking each pixel)


That one on the right is a little joke - I realized there was a maximum number of images that could be made in that format... at first I badly underestimated how many that is, but it turns out it's 16^2025. (one square could be 16 colors, 2 squares would be 16 * 16, etc) Anyway, most calculators don't even try to figure out what that is, they just call it "infinity".

So here's the thing: that "infinity" is much, much, much bigger than the number of unique SHA-1 hashes. If you were to make a hash for each image, you would certainly get a large number of collisions. In fact, 45x45 is extravagant - by my reckoning you could flood SHA-1 with a simple 16 colors at 10x10, which gives you 2.6 * 10 ^ 120 pictures. (I encourage people to check my math - I've certainly got it wrong before.)

So SHA-1 hashspace is so much bigger than what humanity could conceivably generate, and yet the universe of everything - if you don't put many restrictions on the grammar of the everything you're generating - is so much larger than that.

I don't think our brains can even deal with a million, never mind billions or trillions. (My 6th grade math teacher had a book of a thousand pages of a thousand dots each, with certain amusing values labeled.) Hell, get a dollars worth of pennies, lay em out in an uneven sprawl on a flat surface, and I'll bet you think it looks more like 40 cents.

Or, just watch this:

No comments:

Post a Comment