I need your help in designing and code template (if possible) on the below task in SCALA + SPARK.
file 1: It is csv file with a "id" + "name" +"description" columns.
id,name,description 1,a,first post in stack overflow 2,b,second post in stack overflow 3,c,third post in stack overflow 2a,b,seconds post in stack overflow 4,b,never posted in stack overflow
file 2: It is also csv file having "Keyword" + "value" column
keyword,value second,1234 never post,0000
Requirement:
Search file2:keyword in file1:description, on matching concatenate file1:*, file2:"keyword" "matching_percentage" (calculate it) and file2: "value", write output csv.
output: new csv id,name,description,keyword,match_percentage,value
1,a,first post in stack overflow,none,00,none 2,b,second post in stack overflow,second,x%,1234 3,c,third post in stack overflow,none,00,none 2a,b,seconds post in stack overflow,second,x-%,1234 4,b,never posted in stack overflow,never post,y%,0000
I tried sub-string match in spark sql
select a.all_col,b.all_col from file1 a left join file2 b on a.description like concat ('%',b.keyword,'%')
but it is taking a lot of time (more than 50 min) due to Cartisan join. file1 has 10 million records and file2 has 500 records. I have not figured out matching_percentage calculation yet.
Any help is highly appreciated. Thanks in advance.