get columns post group by in pyspark with dataframes

Question

I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.

joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
    jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'

Entire code snippet

df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")

df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()

joinedDF = df.join(df_agg, "company")

score 1 · Accepted Answer · answered Jan 16 '20 at 08:54

1

on the second line you have .show at the end

df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()

remove it like this:

df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)

and your code should work.

You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)

answered Jan 16 '20 at 08:54

M. Alexandru

614
5
20

Thank you so much - I don't run into any error - however post join I loose the groupby and aggregation which happened in df_agg – oneday Jan 16 '20 at 10:34
After join you loose the TotalRaised column ? Please mark the answer – M. Alexandru Jan 16 '20 at 11:03
nope I loose the groupby and orderBy operation performed – oneday Jan 16 '20 at 11:08
Looking at your code, that behavior is normal. You have "df" variable which is raw data, after that you have df_agg which is aggregated. After that you are joining the first one("df", non aggregated) with df_agg(aggregated), so for each row from your "df" you are bringing the TotalRaised column . If you want only the aggregated data us df_agg – M. Alexandru Jan 16 '20 at 15:23

get columns post group by in pyspark with dataframes

1 Answers1