I have been building my application on Python but for some reason I need to put it on a distributed environment, so I'm trying to build and application
using Spark but unable to come up with a code as fast as shift
in Pandas.
mask = (df['name_x'].shift(0) == df['name_y'].shift(0)) & \
(df['age_x'].shift(0) == df['age_y'].shift(0))
df = df[~mask1]
Where
mask.tolist()
gives
[True, False, True, False]
The final result df
will contain only two rows (2nd and 4th).
Basically trying to remove rows where, [name_x, age_x]col duplicates if present on [name_y,age_y]col.
Above code is on Pandas dataframe. What would be the closest PySpark code which is as efficient but without importing Pandas?
I have checked Window
on Spark but not sure of it.