alter a dataframe by applying rules from a second dataframe

Question

Hi I have two dataframes.

Input data Frame:-

id   number      idsc                 mfd 
738  as6812      *fage abc van brw    amz 
745  786-151     *glaeceau smt sp     amz 
759  b0nadum     ankush 574415 mo...  admz 
764  fdad3-al-c  lib anvest-al...     amz 
887  rec-2s-5    or abc sur...        c 
64   00954       ankush pure g...     amz 
8    0000686     dor must die         a 
3    000adf623   bsc test 10-pi...    amz

Check Conditions Data Frame:-

condition      destinationfield expression                                b_id          
True           idsc             [idsc].lower()                            1 
[mfd]=="amz"   idsc             re.sub(r'\abc\b','a',[idsc])              1 
[mfd]=="admz"  idsc             re.sub(r'and \d+ other item', '', [idsc]) 1 
True           idsc             re.sub(r'[^a-z0-9 ]',r'',[idsc])          1 
True           idsc             [idsc].strip()                            1 
[mfd] == "c"   idsc             re.sub(r'\ankush\b','ank',[idsc])         1 
True           number           re.sub(r'[^0-9]',r'',[number])            1
True           number           [number].strip()                          1

I am looking for applying each and every rule of condition dataframe on input dataframe and obtain a new dataframe.

If my condition is true then I need to apply it on all the rows. if there is any specific value apart from true then I need to apply that condition for a specific record.

Is there any better approach to do this in pyspark as the regular expressions are related to python. Instead of running it in for loops.

id   number     idsc              mfd 
738  as6812     *fage a van brw   amz 
745  786-151    *glaeceau smt sp  amz 
759  b0nadum    ank 574415 mo...  admz 
764  fdad3-al-c lib anvest-al...  amz 
887  rec-2s-5   or a sur...       c 
64   00954      ank pure g...     amz 
8    0000686    dor must die      a 
3    000adf623  bsc test 10-pi... amz

input data pipe separated

id| number|idsc|mfd 
738|as6812|*fage abc van brw|amz 
745|786-151|*glaeceau smt sp|amz 
759|b0nadum|ankush 574415 mo...|admz 
764|fdad3-al-c|lib anvest-al...|amz 
887|rec-2s-5|or abc sur...|c 
64| 00954|ankush pure g...|amz 
8|0000686|dor must die a 
3|000adf623|bsc test 10-pi...|amz

Conditions Data pipe separated

condition|destinationfield|expression|b_id|

True|idsc|[idsc].lower()|1 
[mfd]=="amz"|idsc|re.sub(r'\abc\b','a',[idsc])|1 
[mfd]=="admz"|idsc|re.sub(r'and \d+ other item', '', [idsc])|1 
True|idsc|re.sub(r'[^a-z0-9 ]',r'',[idsc])|1 
True|idsc|[idsc].strip()|1 
[mfd] == "c"|idsc|re.sub(r'\ankush\b','ank',[idsc])|1 
True|number|re.sub(r'[^0-9]',r'',[number])|1
True|number|[number].strip()|1

Thanks, Ankush Reddy

Hi @user3483203 made changes to the question I think we should be able to load into a dataframe now. Thanks. — ankush reddy, Jun 19 '18 at 15:48
Is the "conditions" dataframe very large? If not then to achieve these transformations it would be most effective to just use pyspark.sql.DataFrame functions. For example -- `df = df.withColumn("idsc", F.lower("idsc"))` would do your first condition. — kshell, Jun 22 '18 at 23:01
@ankushreddy 14000 records doesnt seem much big. Is it possible to collect it in a list? — mayank agrawal, Jun 29 '18 at 08:17

score 1 · Answer 1 · answered Jun 28 '18 at 16:39

You could try to make your Condition Dataframe evaluable.

If it is evaluable, you could call eval() on the conditions and expression.

def apply_condition(df, df_condition):
 # write a function get_df_evaluable_condition which both does
 # replace "[any_column]" by "df.['any_column']" in createcondition
 # replace "[destinationfield]" by "element" in expression

 df_evaluable_condition = get_df_evaluable_condition(df_evaluable_condition)

 for index, row in df_evaluable_condition.iterrows():
    createcondition = row['createcondition']
    destinationfield = row['destinationfield']
    expression = row['expression']
    # just apply expression where createcondition is true
    df.loc[eval(createcondition), destinationfield] = 
      df.loc[eval(createcondition), destinationfield].apply(lambda element: eval(expression))

By the way, if the expression contains a reference to a column which is not the destination column, the last line of code won't work. You will need something more complex to achieve what you want.

I don't advise using this method if you don't know what your Condition Dataframe looks like. Don't call eval() on unknown strings !

alter a dataframe by applying rules from a second dataframe

1 Answers1