Questions tagged [lsh]

Locality-sensitive hashing

Locality-sensitive hashing reduces the dimensionality of high-dimensional data. LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items). LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items.1 Locality-sensitive hashing has much in common with data clustering and nearest neighbor search.

48 questions

votes

0 answers

How to reduce the shuffle write caused by approxSimilarityJoin in spark?

I am using approxSimilarityJoin to find Jaccard similarity between two sets. val dfA = hashtoseq.toDF("id","values") //values is a set of string val hashingTF = new…

scala apache-spark lsh

asked May 10 '18 at 00:16

Rajjat Dadwal

votes

0 answers

Apply LSH approxNearestNeighbors on all rows of a dataframe

I am trying to apply BucketedRandomProjectionLSH's function model.approxNearestNeighbors(df, key, n) on all the rows of a dataframe in order to approx-find the top n most similar items for every item. My dataframe has 1 million rows. My problem is…

python apache-spark pyspark lsh

asked May 14 '19 at 09:32

confused_pandas

votes

1 answer

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour. After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to…

apache-spark duplicates apache-spark-mllib minhash lsh

asked Jul 18 '18 at 13:47

Tom Lous

2,819
2
25
46

votes

1 answer

Pyspark LSH Followed by Cosine Similarity

I have many users and each user has an associated vector. I would like to compute the cosine similarity between each user. This is prohibitive based on the size. It seems LSH is a good approximation step, which I understand will create buckets where…

apache-spark pyspark nearest-neighbor lsh

asked Jun 10 '22 at 20:56

B_Miner

1,840
4
31
66

votes

0 answers

How to select in SQL by LSH distance?

Say, I have a table in Postgres (or MySQL), one of the columns called the_lsh holds a 70 char LSH in hex. And now I'd like to select. I have my 70 char LSH. Can I select all rows where the distance from each the_lsh to my LSH is no farther than 11…

sql postgresql lsh

asked Jul 14 '22 at 15:15

shal

2,854
4
20
31

votes

0 answers

How to perform the Text Similarity using BERT on 10M+ corpus? Using LSH/ ANNOY/ fiass or sklearn?

My idea is to extract the CLS token for all the text in the DB and save it in CSV or somewhere else. So when a new text comes in, instead of using the Cosine Similarity/JAccard/MAnhattan/Euclidean or other distances, I have to use some approximation…

tensorflow scikit-learn pytorch bert-language-model lsh

asked Jan 03 '21 at 07:52

Deshwal

3,436
4
35
94

votes

1 answer

Is there a way to have fast boolean operations on scipy.sparse matrices?

I have to solve a XOR operation on very high dimensional (~30'000) vectors to compute the Hamming distance. For example, I need to compute the XOR operation between one vector full of False with 16 sparsily located True with each row of a…

python hash sparse-matrix lsh

asked Dec 01 '20 at 14:19

olivier clerc

votes

1 answer

datasketch: MinHash LSH Forest

I'm trying to create a forst for nearest neighbor searching but I'm not sure I'm doing it right or even if MinHash / LSH is an appropriate choice for my data. I ask this because the results are not usable. I'm trying to follow the example in the…

python lsh

asked Apr 03 '18 at 11:14

beginner_

7,230
18
70
127

vote

0 answers

LSH and minhasing - Why does hashing the signature matrix make sense?

I'm learning about LSH and minhashing and I'm trying to understand the rational of hashing the signature matrix: We divide the signature matrix to bands and we hash (using which hash function?) every portion of column to k buckets. Why would it make…

algorithm hash computer-science similarity lsh

asked Apr 23 '22 at 15:10

Elimination

2,619
4
22
38

vote

0 answers

Give a hex color to a string so that a string similar to another string will get a similar color

I have a list of multiple strings, I want to get a hex color (like hash) for each string so that similar string will get similar colors. Example: "read register 1" -> #242fff "read register 2" -> #242fa5 "print hello" -> #990000 "read reg 2" ->…

python-3.x hash colors lsh

asked Nov 11 '21 at 21:47

Hanna

vote

0 answers

How to work with BucketedRandomProjectionLSH

I have two datasets dfA (5M) and dfB (6K). I train the LSH on spark 2.2: val brp = new BucketedRandomProjectionLSH() .setBucketLength(2.0) .setNumHashTables(3) .setInputCol("features") .setOutputCol("hashes") val brp_model =…

apache-spark apache-spark-mllib lsh

asked May 04 '21 at 20:18

belz

vote

1 answer

Spark LSH pipeline, performance issues when increasing text length

Starting from this example, I used a Locality-Sensitive Hashing (LSH) on Pyspark in order to find duplicated documents. Some notes about my DB: I have 4M text files. Each file on average has 20K chars. Currently, I’m considering only the first…

apache-spark machine-learning pyspark lsh

asked Apr 26 '21 at 11:02

Ire00

vote

2 answers

Number of pairs in calculating Jaccard distance using PySpark are less than they should be

I am trying to calculate Jaccard distance between certain ids with their attributes in the form of SparseVectors. from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col from…

pyspark apache-spark-mllib apache-spark-ml minhash lsh

asked Jan 15 '21 at 22:11

secretive

2,032
7
16

vote

0 answers

How to find near by duplicates using LSH in a dataframe?

I have a few columns like Filename, size of the file, and the date and I want to find the nearby duplicates by considering all parameters. Id Name Size Date 1 lib_mysqludf_sys.html 8934 …

python pandas duplicates lsh

asked Dec 24 '20 at 08:52

sheel

vote

1 answer

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates.…

r md5 minhash ropensci lsh

asked Aug 15 '20 at 16:43

retrography

6,302
3
22
32

2 3 4 Next