Locality Sensitive Hashing - Java

Introduction

This library implements various locality sensitive hashing schemes, many of them being described in the book Mining of Massive Datasets [mmd]. The more similar the documents being hashed, the more likely the hashes values will be the same.

One of the use case is the ability to find the top-k most similar among a huge collection of documents in an efficient manner.

The project is located at https://gitlab.com/bmahe/locality-sensitive-hashing-java and is licensed under the Apache License v2.0.

Quickstart

Dependencies

Let’s add the repository:

<repositories>
    <repository>
        <id>gitlab-maven</id>
        <url>https://gitlab.com/api/v4/projects/12813983/packages/maven</url>
    </repository>
</repositories>

Then the dependency:

<dependency>
    <groupId>net.bmahe.utils</groupId>
    <artifactId>core</artifactId>
    <version>1.4</version>
</dependency>

Algorithms

LSH - MinHash

Introduction

LSH MinHash enables to find the top-k most similar documents based on the jaccard similarity metric. The flow goes as follow:

diag e8cea720fba5d0bb6be938c7ee118ad4

We observe three main parameters:

  • Hash size: how long will the vector generated by MinHash be

  • Number of bands: describes in how many parts the MinHash vector should be split into

  • Number of buckets: describes in how many buckets should each band be hashed into

The combination of these parameters will shape the likelyhood of documents generating the same final hash based on their jaccard similarity.

Note on MinHash: The jaccard similarity of two hashed vectors approximates the jaccard similarity of the orignal documents

How to use

Let’s import the required classes:

import net.bmahe.lsh.core.LSH;
import net.bmahe.lsh.core.LSHMinHashFactory;

We can then build an instance of UnorderedLSH based on the MinHash algorithm, which also require a few parameters:

final int numBands = 5; (1)
final int numBuckets = 10; (2)
final Random seed = new Random(); (3)

final UnorderedLSH lshMinHash = LSHMinHashFactory.build(numBands, numBuckets, seed);
1 Number of bands
2 Number of buckets for each band
3 Instance of java.util.Random or long to be used as a seed for the purpose of generating random hash functions. This is required for purpose of having reproducible hashes across executions

We can now hash a document provided as an int[]:

final int[] docArray = { 1, 150, 200, 6 };
final int[] hashDocArray = lshMinHash.hash(docArray);

We could also hash a document provided as an Set<Integer>:

final Set<Integer> docSet = Set.of(300, 6, 200, 150);
final int[] hashDocSet = lshMinHash.hash(docSet);

LSH - Random Hyperplanes

Introduction

LSH with Random Hyperplanes enables to find the top-k most similar documents based on the cosine similarity metric. The flow goes as follow:

diag 7a60b303f01e67dd58bf12f465af5b18

We observe three main parameters:

  • Hash size: how long will the vector generated by random hyperplanes be. Note: this is the number of hyperplanes used

  • Number of bands: describes in how many parts the output vector of the random hyperplanes should be split into

  • Number of buckets: describes in how many buckets should each band be hashed into

The combination of these parameters will shape the likelyhood of documents generating the same final hash based on their cosine similarity.

Notes on Random hyperplanes:

  • The number of matching entries of two hashed vectors approximates the cosine similarity of the orignal documents. The more they match, the more the original vectors are going in the same direction.

  • Contrary to MinHash case, the vector size is fixed and the order of its elements do matter as we are computing the cosine similarity of two vectors.

How to use

Let’s import the required classes:

import net.bmahe.lsh.core.LSH;
import net.bmahe.lsh.core.LSHRandomHyperPlanesHashFactory;

We can then build an instance of LSH based on the Random Hyperplanes algorithm, which also require a few parameters:

final int inputSize = 4; (1)
final int hashSize = 3; (2)
final int numBands = 5; (3)
final int numBuckets = 10; (4)
final Random seed = new Random(); (5)

final LSH lshRandomHyperPlanes = LSHRandomHyperPlanesHashFactory.build(inputSize, hashSize, numBands, numBuckets,
                seed);
1 Vector size of input documents
2 Hash size generated by the Random Hyperplanes hashing
3 Number of bands
4 Number of buckets for each band
5 Instance of java.util.Random or long to be used as a seed for the purpose of generating random hash functions. This is required for purpose of having reproducible hashes across executions

We can now hash a document provided as an int[]:

final int[] docArray = { 1, 150, 200, 6 };
final int[] hashDocArray = lshRandomHyperPlanes.hash(docArray);

We could also hash a document provided as an double[]:

final double[] docDoubleArray = { 1.0d, 150.0d, 200.0d, 6.0d };
final int[] hashDoubleDocArray = lshRandomHyperPlanes.hash(docDoubleArray);

References

  • [mmd] Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets. http://www.mmds.org/