3

I am new to PySpark and was wondering how you would use method chaining there. In pandas I would use assign with lambda, so for example

df = pd.DataFrame({'number':[1,2,3],'date':['31-dec-19','02-jan-18','14-mar-20']})

df = (df.assign(number_plus_one = lambda x: x.number + 1)
        .assign(date = lambda x: pd.to_datetime(x.date))
        .loc[lambda x: x.number_plus_one.isin([2,3])]
        .drop(columns=['number','number_plus_one'])
      )

How would you write the same code in PySpark without converting it to a pandas dataframe? I guess you could use filter, withColumn and drop, but how exactly would you do it with method chaining?

corianne1234
  • 634
  • 9
  • 23

1 Answers1

1

You can do the same thing in Spark by chaining calls in a similar way. Here's an example:

sc.parallelize([Row(number=1, date='31-dec-19'),
                     Row(number=1, date='31-dec-19'),
                     Row(number=1, date='31-dec-19')])\
.toDF()\
.withColumn('number_plus_one', f.col('number') + 1)\
.filter(f.col('number_plus_one').cast(IntegerType()).isin(f.lit(2), f.lit(3)) )\
.drop('number','number_plus_one')\
.show()

Result

+---------+
|     date|
+---------+
|31-dec-19|
|02-jan-18|
+---------+
ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • Thanks a lot. I do not really understand the `f` you are using and I get an error about it: `An error was encountered: name 'f' is not defined`. What can I do about this? – corianne1234 Apr 28 '20 at 13:14
  • 1
    Sorry, it's an alias: `from pyspark.sql import functions as f` – ernest_k Apr 28 '20 at 13:15
  • great, thanks. Just one more question, how would you do the `pd.to_datetime(x.date)`? – corianne1234 Apr 28 '20 at 13:28
  • There's no direct way to do that with column values. You can do something like `spark.createDataFrame([[1,'31-dec-19'],[2,'02-jan-18'],[3,'14-mar-20']], ['number', 'date'])`. So just zip your columns into rows. – ernest_k Apr 28 '20 at 13:38
  • But then I still have date strings? I have a column with dates but string not date format. How would I convert it to date then? – corianne1234 Apr 28 '20 at 13:44
  • You can convert to date time. You can find answers here: https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format – ernest_k Apr 28 '20 at 13:53
  • How would you do it though if you had `df.columns` like for example `df.[...some stuff before].select([f.col(c).cast("string") for c in df.columns])`? Obviously df.columns refers to the dataframe before it was modified, so it is wrong. How can you do this? – corianne1234 May 04 '20 at 12:11