Compute similarity in pyspark

Asked Jul 05 '22 at 07:22

Active Jul 05 '22 at 07:22

Viewed 158 times

I have a csv file contains some data, I want select the similar data with an input. my data is like:

H1      | H2      | H3
--------+---------+----------
A       | 1       | 7
B       | 5       | 3
C       | 7       | 2

And the data point that I want find data similar to that in my csv is like : [6, 8].

Actually I want find rows that H2 and H3 of data set is similar to input, and It return H1.

I want use pyspark and some similarity measure like Euclidean Distance, Manhattan Distance, Cosine Similarity or machine learning algorithm.

asked Jul 05 '22 at 07:22

Tavakoli

could you please explain a bit more? what outcome are you looking for with `[6, 8]`? – samkart Jul 05 '22 at 09:38
I want calculate cosin similarity between csv rows and my inputs. In above example 3 number that show the similarly. – Tavakoli Jul 05 '22 at 09:43
have a look at [this Q](https://stackoverflow.com/q/46725290/8279585) – samkart Jul 05 '22 at 09:55
@samkart tanks a lot. I think that links help to me. but i have some problem with [BucketedRandomProjectionLSH](https://stackoverflow.com/questions/72897923/what-do-fit-in-bucketedrandomprojectionlsh-in-spark) – Tavakoli Jul 08 '22 at 07:13

0 Answers0