spark Apply filter multiple times

Question

I need your help in designing and code template (if possible) on the below task in SCALA + SPARK.

file 1: It is csv file with a "id" + "name" +"description" columns.

       id,name,description
       1,a,first post in stack overflow
       2,b,second post in stack overflow 
       3,c,third post in stack overflow
       2a,b,seconds post in stack overflow 
       4,b,never posted in stack overflow

file 2: It is also csv file having "Keyword" + "value" column

       keyword,value
       second,1234
       never post,0000

Requirement:

Search file2:keyword in file1:description, on matching concatenate file1:*, file2:"keyword" "matching_percentage" (calculate it) and file2: "value", write output csv.

output: new csv id,name,description,keyword,match_percentage,value

       1,a,first post in stack overflow,none,00,none
       2,b,second post in stack overflow,second,x%,1234
       3,c,third post in stack overflow,none,00,none
       2a,b,seconds post in stack overflow,second,x-%,1234 
       4,b,never posted in stack overflow,never post,y%,0000

I tried sub-string match in spark sql

select a.all_col,b.all_col from file1 a left join file2 b on a.description like concat ('%',b.keyword,'%')

but it is taking a lot of time (more than 50 min) due to Cartisan join. file1 has 10 million records and file2 has 500 records. I have not figured out matching_percentage calculation yet.

Any help is highly appreciated. Thanks in advance.

Written in Python but exactly the same rules apply http://stackoverflow.com/q/33168970/1560062 — zero323, Oct 01 '16 at 16:29
Hello, Many thanks for the reference.. tokenizing the document_text will split text to words, but for instance there are multiple words in a row in the keyword column, then it is not matching. — Spark_explorer, Oct 02 '16 at 07:55

spark Apply filter multiple times

0 Answers0