PySpark and method chaining

Question

I am new to PySpark and was wondering how you would use method chaining there. In pandas I would use assign with lambda, so for example

df = pd.DataFrame({'number':[1,2,3],'date':['31-dec-19','02-jan-18','14-mar-20']})

df = (df.assign(number_plus_one = lambda x: x.number + 1)
        .assign(date = lambda x: pd.to_datetime(x.date))
        .loc[lambda x: x.number_plus_one.isin([2,3])]
        .drop(columns=['number','number_plus_one'])
      )

How would you write the same code in PySpark without converting it to a pandas dataframe? I guess you could use filter, withColumn and drop, but how exactly would you do it with method chaining?

score 1 · Accepted Answer · answered Apr 28 '20 at 12:51

1

You can do the same thing in Spark by chaining calls in a similar way. Here's an example:

sc.parallelize([Row(number=1, date='31-dec-19'),
                     Row(number=1, date='31-dec-19'),
                     Row(number=1, date='31-dec-19')])\
.toDF()\
.withColumn('number_plus_one', f.col('number') + 1)\
.filter(f.col('number_plus_one').cast(IntegerType()).isin(f.lit(2), f.lit(3)) )\
.drop('number','number_plus_one')\
.show()

Result

+---------+
|     date|
+---------+
|31-dec-19|
|02-jan-18|
+---------+

answered Apr 28 '20 at 12:51

ernest_k

44,416
5
53
99

Thanks a lot. I do not really understand the `f` you are using and I get an error about it: `An error was encountered: name 'f' is not defined`. What can I do about this? – corianne1234 Apr 28 '20 at 13:14
1

Sorry, it's an alias: `from pyspark.sql import functions as f` – ernest_k Apr 28 '20 at 13:15
great, thanks. Just one more question, how would you do the `pd.to_datetime(x.date)`? – corianne1234 Apr 28 '20 at 13:28
There's no direct way to do that with column values. You can do something like `spark.createDataFrame([[1,'31-dec-19'],[2,'02-jan-18'],[3,'14-mar-20']], ['number', 'date'])`. So just zip your columns into rows. – ernest_k Apr 28 '20 at 13:38
But then I still have date strings? I have a column with dates but string not date format. How would I convert it to date then? – corianne1234 Apr 28 '20 at 13:44
You can convert to date time. You can find answers here: https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format – ernest_k Apr 28 '20 at 13:53
How would you do it though if you had `df.columns` like for example `df.[...some stuff before].select([f.col(c).cast("string") for c in df.columns])`? Obviously df.columns refers to the dataframe before it was modified, so it is wrong. How can you do this? – corianne1234 May 04 '20 at 12:11

PySpark and method chaining

1 Answers1