Rewrite code from Pandas To PySpark

Question

My greetings, I need to rewrite the code from Pandas to PySpark. I'm ok with PySpark, but do not have any skill in Pandas. Could you please tell what does the following code do?

potent_cases.loc[potent_cases['status']==2,'is_too_old'] = potent_cases.loc[potent_cases['status']==2,:]\
            .apply(lambda x: True if x['close_date']  < dt.now() - timedelta(2) else False,axis=1)

cases_to_create = potent_cases.loc[\
            ((potent_cases['status'] == 2) & ((potent_cases['is_too_old'] == True) |( potent_cases['manual'] == False)))|\
            (pd.isnull(potent_cases['status'])),['shop_id','plu','last_shelf_datetime']]

I think you forgot to include [any attempts](http://idownvotedbecau.se/noattempt/) (looks like a basic `filter` and `select`, you could probably copy and past a half of that into PySpark and run) and a [reproducible example](https://stackoverflow.com/q/48427185/8371915). Would you mind correcting this? :) — Alper t. Turker, Jan 26 '18 at 12:03
@user8371915 not really, i cannot find what does that mean `.loc[potent_cases['status']==2,:]` . And I cannot run the Pandas code (I wish I could). P.S. of course I'm totally ok with lambda — Anton Bondar, Jan 26 '18 at 12:20
@AntonBondar - Really? You can't find any documentation on `loc`? — Andrew L, Jan 26 '18 at 12:24

score 2 · Accepted Answer · answered Jan 26 '18 at 14:08

You have a df named potent_cases. With loc you go through the rows. When the column named status has the value = 2, then the value in the same row of the column is_too_old will be True or False according to the condition x['close_date'] < dt.now() - timedelta(2).

Apparently the first line is necessary for the conditions of the second line. Here if the conditions are satisfied, the dataframe case_to_create will be created with the subset of columns 'shop_id','plu','last_shelf_datetime'.

Rewrite code from Pandas To PySpark

1 Answers1