April 18, 2019
A LocalitySensitive Hashing approach towards encoding semantic data into Sparse Distributed Representations, ready to be fed into an Hierarchical Temporal Memory, like NuPIC by Numenta. This uses the SimHash algorithm to accomplish this. LSH and SimHash come from the world of nearestneighbor document similarity searching.
This encoder is sibling with the original Scalar Encoder, and the Random Distributed Scalar Encoder (RDSE). The static bucketing strategy here is generally lifted straight from the RDSE, although the “contents” and representations are created differently.
Instead of creating a random hash for our target bucket, we first
generate a SHA3+SHAKE256 hash of the bucket index (the
SHAKE extension provides a variablelength hash (n
)). Using that
hash, and hash values from nearby neighbor buckets (within
bucketRadius
), we then create a weighted SimHash for our target
bucket index. This SimHash output will represent both individual bucket
value, and represent the relations between nearby neighbor values in
bucket space. A stable number of On bits (w
) is achieved during final
collapsing step of SimHashing.
Original Scalar 
Random Distributed 
SimHash Distributed 


Collisions Allowed 
No  Yes  Yes 
Semantic Continuity 
Physical Adjacency 
Custom Random Map 
SimHashing 
Periodic Wrapping 
Yes  No  Yes^{1} 
Lookup Tables 
No  Yes  No 
Topology Supported 
Yes  No  No 
Required Params 
minval maxval 
resolution  resolution 
Other Params 
radius resolution periodic 
bucketRadius periodic 
^{1}Encodings for wrapped edge buckets will adapt with bucket growth
Tests were run against the simple model presented in the
NuPIC Docs Algo Tutorial. The usual “HotGym”
reccenterhourly.csv
data was used (first 3000 rows, as
per tutorial code). The body of that article contains all settings
used for the Spatial Pooler, Temporal Memory, and SDR Classifier.
Any changes to Encoder settings are noted below.
Original Scalar 
Random Distributed 
SimHash Distributed 


Settings  minval=0 maxval=100 
resolution=0.4  resolution=0.25 
MAPE  0.327  0.300  0.294 
MAE  5.599  5.438  5.954 
RMSE  9.257  8.838  10.296 
NLL 1k  3.396  2.626  3.368 
Time  45.32s  46.06s  65.71s^{1} 
Distance^{2}  2 (stable)  2 (stable)  410 (variable) 
^{1}Speed can easily be brought up with
internal lookup tables
^{2}Average Hamming distance between two
adjacent buckets
Time  Input 

0  27.7 
1  10.2 
2  16.3 
3  16.4 
4  7.6 
5  22.9 
Map input values to bucket indexes, using the formula from the RDSE.
bucketIndex = (
(bucketsWidth / 2) + int(round(
(input  bucketOffset) / resolution)
)
)
Input  Bucket 

27.7  5 
10.2  2 
16.3  3 
16.4  3 
7.6  1 
22.9  4 
Hash bucket index value of a target bucket (3), and
neighbors (bucketRadius
=2).
Bucket  Hash 

1  00110101 
2  11011010 
3  01110111 
4  00011011 
5  10101100 
Bucket  Hash Columns 

1  1 1 +1 +1 1 +1 1 +1 
2  +1 +1 1 +1 +1 1 +1 1 
3  1 +1 +1 +1 1 +1 +1 +1 
4  1 1 1 +1 +1 1 +1 +1 
5  +1 1 +1 1 +1 +1 1 1 
Weight bucket hashes. The target bucket (3) in center of
bucketRadius
is heaviest.
Bucket  Weight  Hash Columns 

1  1  1 1 +1 +1 1 +1 1 +1 
2  2  +2 +2 2 +2 +2 2 +2 2 
3  3  3 +3 +3 +3 3 +3 +3 +3 
4  2  2 2 2 +2 +2 2 +2 +2 
5  1  +1 1 +1 1 +1 +1 1 1 
Bucket  Hash Column Summations 

3  3 1 1 7 1 1 5 3 
Collapse the sums back to binary for a final SimHash value for our target bucket (3).
A regular SimHash will change all sums >= 0
to a binary 1
, while
all sums < 0
are changed to a binary 0
. This SimHash will usually
result in about 50% sparsity.
We can further sparsify a SimHash for our needs by collapsing down
based on a certain number of the highest sums (here, w
=3):
Bucket  Sparse SimHash 

3  0 0 0 1 0 0 1 1 
This SimHash encoded representation will identify both our specific
target bucket, and the relation of our target bucket to it’s
near neighbors (descending outward with bucketRadius
).