April 23, 2019

SimHash Distributed Document Encoder (SHaDDE)

Using the same approch as the recent SimHash Scalar Encoder, there is now a version available which can encode Documents.

This also uses a Locality-Sensitive Hashing approach towards encoding semantic document text data into Sparse Distributed Representations, ready to be fed into an Hierarchical Temporal Memory, like NuPIC by Numenta. This uses the SimHash algorithm to accomplish this. LSH and SimHash come from the world of nearest-neighbor document similarity searching.

Document Tokens are supplied with opitional weighting values. We generate a SHA-3 hash digest for each word token (using SHAKE256 to get a variable-width digest output size). The hashes for a document are combined into a sparse SimHash. Documents that are semantically similar will have similar encodings. Dissimilar documents will have very different encodings from each other. Similarity is defined as binary distance between strings, there is no kind of linguistic semantic understanding. You’ll want http://cortical.io for that.

How It Works

Source Code

Next Steps

Learn More