Hi I have two dataframes.
Input data Frame:-
id number idsc mfd
738 as6812 *fage abc van brw amz
745 786-151 *glaeceau smt sp amz
759 b0nadum ankush 574415 mo... admz
764 fdad3-al-c lib anvest-al... amz
887 rec-2s-5 or abc sur... c
64 00954 ankush pure g... amz
8 0000686 dor must die a
3 000adf623 bsc test 10-pi... amz
Check Conditions Data Frame:-
condition destinationfield expression b_id
True idsc [idsc].lower() 1
[mfd]=="amz" idsc re.sub(r'\abc\b','a',[idsc]) 1
[mfd]=="admz" idsc re.sub(r'and \d+ other item', '', [idsc]) 1
True idsc re.sub(r'[^a-z0-9 ]',r'',[idsc]) 1
True idsc [idsc].strip() 1
[mfd] == "c" idsc re.sub(r'\ankush\b','ank',[idsc]) 1
True number re.sub(r'[^0-9]',r'',[number]) 1
True number [number].strip() 1
I am looking for applying each and every rule of condition dataframe on input dataframe and obtain a new dataframe.
If my condition is true then I need to apply it on all the rows. if there is any specific value apart from true then I need to apply that condition for a specific record.
Is there any better approach to do this in pyspark as the regular expressions are related to python. Instead of running it in for loops.
id number idsc mfd
738 as6812 *fage a van brw amz
745 786-151 *glaeceau smt sp amz
759 b0nadum ank 574415 mo... admz
764 fdad3-al-c lib anvest-al... amz
887 rec-2s-5 or a sur... c
64 00954 ank pure g... amz
8 0000686 dor must die a
3 000adf623 bsc test 10-pi... amz
input data pipe separated
id| number|idsc|mfd
738|as6812|*fage abc van brw|amz
745|786-151|*glaeceau smt sp|amz
759|b0nadum|ankush 574415 mo...|admz
764|fdad3-al-c|lib anvest-al...|amz
887|rec-2s-5|or abc sur...|c
64| 00954|ankush pure g...|amz
8|0000686|dor must die a
3|000adf623|bsc test 10-pi...|amz
Conditions Data pipe separated
condition|destinationfield|expression|b_id|
True|idsc|[idsc].lower()|1
[mfd]=="amz"|idsc|re.sub(r'\abc\b','a',[idsc])|1
[mfd]=="admz"|idsc|re.sub(r'and \d+ other item', '', [idsc])|1
True|idsc|re.sub(r'[^a-z0-9 ]',r'',[idsc])|1
True|idsc|[idsc].strip()|1
[mfd] == "c"|idsc|re.sub(r'\ankush\b','ank',[idsc])|1
True|number|re.sub(r'[^0-9]',r'',[number])|1
True|number|[number].strip()|1
Thanks, Ankush Reddy