I'm starting with Hadoop ecosystem and I'm facing some questions and need your help.
I have two HDFS files and need to execute Levenshtein distance between a group of columns of the first one versus another group of the second one.
This process will be executed each day with a quite considerable amount of data (150M rows in the first file Vs 11M rows in the second one).
I will appreciate to have some guidance (code example, references, etc) on how I can read my two files from HDFS execute Levenshtein distance (using Spark?) as described and save the results on a third HDFS file.
Thank you very much in advance.