When it comes to figuring out how similar various pieces of data are from one another (and which is the closest matching one in a large group of candidates), simhashing is one of my favourite algorithms. It’s somewhat simple, brilliant in its approach, but still not obvious enough for most people (myself included) to come up with it on their own. Readers may be familiar with hashing algorithms such as MD5 or SHA, which aim to very quickly create a unique signature (hash) of the data. These functions are built so that identical files or blobs of data share the same hash, so you can rapidly see whether two blobs are identical or not, or if a blob still has the same signature after transmission to see if it was corrupted or not. Then different blobs, even if mostly the same, get an entirely different signature.

While simhashes still aim to have unique signatures for documents, they also attempt to make sure that documents that look the same get very similar hashes. That way, you can look for similar hashes to figure out if the documents are closely related, without needing to compare them bit by bit. It’s a statistical tool to help us find near-duplicates faster.

That’s a bit vague, but that’s alright. I’ll try to explain things in a way that gives a good understanding of things.