www.luxrota.com

April 18, 2019

SimHash Distributed Scalar Encoder (SHaDSE)

A Locality-Sensitive Hashing approach towards encoding semantic data into Sparse Distributed Representations, ready to be fed into an Hierarchical Temporal Memory, like NuPIC by Numenta. This uses the SimHash algorithm to accomplish this. LSH and SimHash come from the world of nearest-neighbor document similarity searching.

This encoder is sibling with the original Scalar Encoder, and the Random Distributed Scalar Encoder (RDSE). The static bucketing strategy here is generally lifted straight from the RDSE, although the “contents” and representations are created differently.

Instead of creating a random hash for our target bucket, we first generate a SHA-3+SHAKE256 hash of the bucket index (the SHAKE extension provides a variable-length hash (n)). Using that hash, and hash values from nearby neighbor buckets (within bucketRadius), we then create a weighted SimHash for our target bucket index. This SimHash output will represent both individual bucket value, and represent the relations between nearby neighbor values in bucket space. A stable number of On bits (w) is achieved during final collapsing step of SimHashing.

Scalar Encoder Feature Comparison

  Original
Scalar
Random
Distributed
SimHash
Distributed
Collisions
Allowed
No Yes Yes
Semantic
Continuity
Physical
Adjacency
Custom
Random Map
SimHashing
Periodic
Wrapping
Yes No Yes1
Lookup
Tables
No Yes No
Topology
Supported
Yes No No
Required
Params
minval
maxval
resolution resolution
Other
Params
radius
resolution
periodic
  bucketRadius
periodic

   1Encodings for wrapped edge buckets will adapt with bucket growth

Scalar Encoder Performance Comparison

Tests were run against the simple model presented in the NuPIC Docs Algo Tutorial. The usual “HotGym” rec-center-hourly.csv data was used (first 3000 rows, as per tutorial code). The body of that article contains all settings used for the Spatial Pooler, Temporal Memory, and SDR Classifier. Any changes to Encoder settings are noted below.

  Original
Scalar
Random
Distributed
SimHash
Distributed
Settings minval=0
maxval=100
resolution=0.4 resolution=0.25
MAPE 0.327 0.300 0.294
MAE 5.599 5.438 5.954
RMSE 9.257 8.838 10.296
NLL 1k 3.396 2.626 3.368
Time 45.32s 46.06s 65.71s1
Distance2 2 (stable) 2 (stable) 4-10 (variable)

   1Speed can easily be brought up with internal lookup tables
   2Average Hamming distance between two adjacent buckets

How It Works

Step 1 - Input some scalar values:
Time Input
0 27.7
1 10.2
2 16.3
3 16.4
4 7.6
5 22.9
Step 2 - Map inputs to buckets:

Map input values to bucket indexes, using the formula from the RDSE.

  bucketIndex = (
    (bucketsWidth / 2) + int(round(
      (input - bucketOffset) / resolution)
    )
  )
Input Bucket
27.7 5
10.2 2
16.3 3
16.4 3
7.6 1
22.9 4
Step 3 - Hash bucket index

Hash bucket index value of a target bucket (3), and neighbors (bucketRadius=2).

Bucket Hash
1 00110101
2 11011010
3 01110111
4 00011011
5 10101100
Step 4 - Convert binary 0’s in hashes to integer -1:
Bucket Hash Columns
1 -1 -1 +1 +1 -1 +1 -1 +1
2 +1 +1 -1 +1 +1 -1 +1 -1
3 -1 +1 +1 +1 -1 +1 +1 +1
4 -1 -1 -1 +1 +1 -1 +1 +1
5 +1 -1 +1 -1 +1 +1 -1 -1
Step 5 - Weight bucket hashes:

Weight bucket hashes. The target bucket (3) in center of bucketRadius is heaviest.

Bucket Weight Hash Columns
1 1 -1 -1 +1 +1 -1 +1 -1 +1
2 2 +2 +2 -2 +2 +2 -2 +2 -2
3 3 -3 +3 +3 +3 -3 +3 +3 +3
4 2 -2 -2 -2 +2 +2 -2 +2 +2
5 1 +1 -1 +1 -1 +1 +1 -1 -1
Step 6 - Sum weighted binary columns:
Bucket Hash Column Summations
3 -3 1 1 7 1 1 5 3
Step 7 - Collapse integer sums back to binary:

Collapse the sums back to binary for a final SimHash value for our target bucket (3).

A regular SimHash will change all sums >= 0 to a binary 1, while all sums < 0 are changed to a binary 0. This SimHash will usually result in about 50% sparsity.

We can further sparsify a SimHash for our needs by collapsing down based on a certain number of the highest sums (here, w=3):

Bucket Sparse SimHash
3 0 0 0 1 0 0 1 1

This SimHash encoded representation will identify both our specific target bucket, and the relation of our target bucket to it’s near neighbors (descending outward with bucketRadius).

Source Code

Next Steps

Learn More