There are 3 reactions to the title of this talk:
- What the heck’s a probabilistic data structure?
- UFO Sightings… wha?
- 112,092 is an oddly circumstantial number.
This is simply a talk about the first bullet point with the second thrown in just for fun. I like weird stuff—UFOs, Bigfoot, peanut butter and bologna on toast—maybe you do too? As far as the 3rd bullet point, well, that’s how many sightings I have. Now, if you’re like most developers, you most likely have no thought what probabilistic data structures are. In fact, I did a super-scientific poll on Twitter and found that out of 119 participants, 58% had never heard of them and 22% had heard the word but nothing more. I wonder what percent of that 22% heard the word for the first time in the poll. We’re a literal-minded lot at times. Anyhow. That’s 4 out of 5 developers or, as I like to call it, the Trident dentist ratio. (It’s actually a manifestation of the Pareto rule but I’m a 70s kid). That’s quite a few folks that request to be educated. So, let’s do that. A probabilistic data structure is, well, they’re kind of like the TARDIS—bigger on the inside—and JPEG compression—a bit lossy. And, like both, they are fast, accurate enough, and can take you to interesting places of adventure. That last 1 might not be something a JPEG does. More technically speaking, most probabilistic data structures usage hashes to give you faster and smaller data structures in exchange for precision. If you’ve got a mountain of data to process, this is super useful. In this talk, we’ll briefly go over any common probabilistic data structures; dive deep into a couple (Bloom Filter, MinHash, and Top-K); and show a moving application that makes usage of Top-K to analyse the most commonly utilized words in all 112,092 of my UFO sightings. erstwhile we’re done, you’ll be ready to start utilizing any of these structures in your own applications. And, if you usage the UFO data, possibly you’ll discover that the fact truly is out there.