-1

I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.

joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
    jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'

Entire code snippet

df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")

df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()

joinedDF = df.join(df_agg, "company")
oneday
  • 629
  • 1
  • 9
  • 32

1 Answers1

1

on the second line you have .show at the end

df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()

remove it like this:

df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)

and your code should work.

You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)

M. Alexandru
  • 614
  • 5
  • 20
  • Thank you so much - I don't run into any error - however post join I loose the groupby and aggregation which happened in df_agg – oneday Jan 16 '20 at 10:34
  • After join you loose the TotalRaised column ? Please mark the answer – M. Alexandru Jan 16 '20 at 11:03
  • nope I loose the groupby and orderBy operation performed – oneday Jan 16 '20 at 11:08
  • Looking at your code, that behavior is normal. You have "df" variable which is raw data, after that you have df_agg which is aggregated. After that you are joining the first one("df", non aggregated) with df_agg(aggregated), so for each row from your "df" you are bringing the TotalRaised column . If you want only the aggregated data us df_agg – M. Alexandru Jan 16 '20 at 15:23