Questions tagged [lsh]

Locality-sensitive hashing

Locality-sensitive hashing reduces the dimensionality of high-dimensional data. LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items). LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items.1 Locality-sensitive hashing has much in common with data clustering and nearest neighbor search.

48 questions
7
votes
0 answers

How to reduce the shuffle write caused by approxSimilarityJoin in spark?

I am using approxSimilarityJoin to find Jaccard similarity between two sets. val dfA = hashtoseq.toDF("id","values") //values is a set of string val hashingTF = new…
Rajjat Dadwal
  • 183
  • 1
  • 18
6
votes
0 answers

Apply LSH approxNearestNeighbors on all rows of a dataframe

I am trying to apply BucketedRandomProjectionLSH's function model.approxNearestNeighbors(df, key, n) on all the rows of a dataframe in order to approx-find the top n most similar items for every item. My dataframe has 1 million rows. My problem is…
confused_pandas
  • 336
  • 5
  • 15
6
votes
1 answer

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour. After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to…
Tom Lous
  • 2,819
  • 2
  • 25
  • 46
5
votes
1 answer

Pyspark LSH Followed by Cosine Similarity

I have many users and each user has an associated vector. I would like to compute the cosine similarity between each user. This is prohibitive based on the size. It seems LSH is a good approximation step, which I understand will create buckets where…
B_Miner
  • 1,840
  • 4
  • 31
  • 66
2
votes
0 answers

How to select in SQL by LSH distance?

Say, I have a table in Postgres (or MySQL), one of the columns called the_lsh holds a 70 char LSH in hex. And now I'd like to select. I have my 70 char LSH. Can I select all rows where the distance from each the_lsh to my LSH is no farther than 11…
shal
  • 2,854
  • 4
  • 20
  • 31
2
votes
0 answers

How to perform the Text Similarity using BERT on 10M+ corpus? Using LSH/ ANNOY/ fiass or sklearn?

My idea is to extract the CLS token for all the text in the DB and save it in CSV or somewhere else. So when a new text comes in, instead of using the Cosine Similarity/JAccard/MAnhattan/Euclidean or other distances, I have to use some approximation…
Deshwal
  • 3,436
  • 4
  • 35
  • 94
2
votes
1 answer

Is there a way to have fast boolean operations on scipy.sparse matrices?

I have to solve a XOR operation on very high dimensional (~30'000) vectors to compute the Hamming distance. For example, I need to compute the XOR operation between one vector full of False with 16 sparsily located True with each row of a…
2
votes
1 answer

datasketch: MinHash LSH Forest

I'm trying to create a forst for nearest neighbor searching but I'm not sure I'm doing it right or even if MinHash / LSH is an appropriate choice for my data. I ask this because the results are not usable. I'm trying to follow the example in the…
beginner_
  • 7,230
  • 18
  • 70
  • 127
1
vote
0 answers

LSH and minhasing - Why does hashing the signature matrix make sense?

I'm learning about LSH and minhashing and I'm trying to understand the rational of hashing the signature matrix: We divide the signature matrix to bands and we hash (using which hash function?) every portion of column to k buckets. Why would it make…
Elimination
  • 2,619
  • 4
  • 22
  • 38
1
vote
0 answers

Give a hex color to a string so that a string similar to another string will get a similar color

I have a list of multiple strings, I want to get a hex color (like hash) for each string so that similar string will get similar colors. Example: "read register 1" -> #242fff "read register 2" -> #242fa5 "print hello" -> #990000 "read reg 2" ->…
Hanna
  • 11
  • 2
1
vote
0 answers

How to work with BucketedRandomProjectionLSH

I have two datasets dfA (5M) and dfB (6K). I train the LSH on spark 2.2: val brp = new BucketedRandomProjectionLSH() .setBucketLength(2.0) .setNumHashTables(3) .setInputCol("features") .setOutputCol("hashes") val brp_model =…
belz
  • 47
  • 6
1
vote
1 answer

Spark LSH pipeline, performance issues when increasing text length

Starting from this example, I used a Locality-Sensitive Hashing (LSH) on Pyspark in order to find duplicated documents. Some notes about my DB: I have 4M text files. Each file on average has 20K chars. Currently, I’m considering only the first…
Ire00
  • 79
  • 6
1
vote
2 answers

Number of pairs in calculating Jaccard distance using PySpark are less than they should be

I am trying to calculate Jaccard distance between certain ids with their attributes in the form of SparseVectors. from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col from…
secretive
  • 2,032
  • 7
  • 16
1
vote
0 answers

How to find near by duplicates using LSH in a dataframe?

I have a few columns like Filename, size of the file, and the date and I want to find the nearby duplicates by considering all parameters. Id Name Size Date 1 lib_mysqludf_sys.html 8934 …
sheel
  • 467
  • 8
  • 23
1
vote
1 answer

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates.…
retrography
  • 6,302
  • 3
  • 22
  • 32
1
2 3 4