0

I need your help in designing and code template (if possible) on the below task in SCALA + SPARK.

  • file 1: It is csv file with a "id" + "name" +"description" columns.

           id,name,description
           1,a,first post in stack overflow
           2,b,second post in stack overflow 
           3,c,third post in stack overflow
           2a,b,seconds post in stack overflow 
           4,b,never posted in stack overflow
    
  • file 2: It is also csv file having "Keyword" + "value" column

           keyword,value
           second,1234
           never post,0000
    

Requirement:

Search file2:keyword in file1:description, on matching concatenate file1:*, file2:"keyword" "matching_percentage" (calculate it) and file2: "value", write output csv.

  • output: new csv id,name,description,keyword,match_percentage,value

           1,a,first post in stack overflow,none,00,none
           2,b,second post in stack overflow,second,x%,1234
           3,c,third post in stack overflow,none,00,none
           2a,b,seconds post in stack overflow,second,x-%,1234 
           4,b,never posted in stack overflow,never post,y%,0000
    

I tried sub-string match in spark sql

select a.all_col,b.all_col from file1 a left join file2 b on a.description like concat ('%',b.keyword,'%')

but it is taking a lot of time (more than 50 min) due to Cartisan join. file1 has 10 million records and file2 has 500 records. I have not figured out matching_percentage calculation yet.

Any help is highly appreciated. Thanks in advance.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Written in Python but exactly the same rules apply http://stackoverflow.com/q/33168970/1560062 – zero323 Oct 01 '16 at 16:29
  • Hello, Many thanks for the reference.. tokenizing the document_text will split text to words, but for instance there are multiple words in a row in the keyword column, then it is not matching. – Spark_explorer Oct 02 '16 at 07:55

0 Answers0