-1

I'm starting with Hadoop ecosystem and I'm facing some questions and need your help.

I have two HDFS files and need to execute Levenshtein distance between a group of columns of the first one versus another group of the second one.

This process will be executed each day with a quite considerable amount of data (150M rows in the first file Vs 11M rows in the second one).

I will appreciate to have some guidance (code example, references, etc) on how I can read my two files from HDFS execute Levenshtein distance (using Spark?) as described and save the results on a third HDFS file.

Thank you very much in advance.

1 Answers1

1

I guess you have csv file so you can read the directly to the dataframe:

val df1 =  spark.read.option("header","true").csv("hdfs:///pathtoyourfile_1")

The spark.sql.functions module conatins deflevenshtein(l: Column, r: Column): Column function so you need to pass as a parameter - dataframe column with String type, if you want to pass a group of columns you can take concat('col1,'col2,..) function to concatenate multiple columns and pass them to the previous function. If you have 2 or more dataframes you have to join them into one dataframe and then perform distance calculation. Finally you can save your results to csv using df.write.csv("path")

chlebek
  • 2,431
  • 1
  • 8
  • 20
  • Thank you very much for this interesting approach. If a concat columns how Levenshtein algorithm will be impacted? Because in the first file I have one single column containing customer names (first and last name in any order) and in the second I need to concat first and last name to build the same customer name, but depending on the order Levenshtein algorithm will take different scores? If yes, how I can prevent this? Thank you in advance! – hadoop master Feb 06 '21 at 22:57
  • If you don't know which value is first/last name you can concat them and sort alphabetically and then compute distance, for example `split` will create array of characters then `array_sort` will sort array and `array_join` will create single sorted string `array_join(array_sort(split('value,"")),"")` so "John|Smith" => "JShhimnot" – chlebek Feb 06 '21 at 23:29
  • Great! Thanks a lot. Just one last question, actually I don't have any key to join the two files, each one of them is coming from a different source systems without a common key, the goal is to parse them and try to calculate the similarity between all the entries in the first file vs all the rows on the second one. How could I create a single dataframe if there is no common key? – hadoop master Feb 07 '21 at 18:37
  • You can add id column to both dataframes and use it to join them `df.coalesce(1).withColumn("idx", monotonicallyIncreasingId())` you should use `coalesce(1)` to keep order of rows, otherwise spark could mix row order, more here https://stackoverflow.com/questions/48209667/using-monotonically-increasing-id-for-assigning-row-number-to-pyspark-datafram – chlebek Feb 07 '21 at 20:59