adding new column to my spark dataframe , and calculat the sum()

Question

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

score 0 · Answer 1 · answered May 20 '19 at 15:14

First of all: It is really important that you give us a reproducible example of your dataframe. Nobody likes to look at screenshots to identify an error.

Your code is not working because spark can't determine how the rows of your groupby and your initial dataframe can be merge. It isn't aware of that NUM_TIERS is somekind of a key. Therefore you have to tell spark which column(s) should be used to merge the groupby and the initial dataframe.

import pyspark.sql.functions as F
from pyspark.sql import Window

l = [('OBAAAA7K2KBBO'       , 34),
('OBAAAA878000K'      , 138  ),
('OBAAAA878A2A0'      , 164  ),
('OBAAAA7K2KBBO'      , 496),
('OBAAAA878000K'      , 91)]

columns = ['NUM_TIERS', 'MONTAN_TR']

df=spark.createDataFrame(l, columns)

You have to options to do that. You can use a join:

df = df.join(df.groupby('NUM_TIERS').sum('MONTAN_TR'), 'NUM_TIERS')
df.show()

OR a window function:

w = Window.partitionBy('NUM_TIERS')

df = df.withColumn('SUM', F.sum('MONTAN_TR').over(w))

Output is the same for both ways:

+-------------+---------+---+ 
|    NUM_TIERS|MONTAN_TR|SUM| 
+-------------+---------+---+ 
|OBAAAA7K2KBBO|       34|530| 
|OBAAAA7K2KBBO|      496|530| 
|OBAAAA878000K|      138|229| 
|OBAAAA878000K|       91|229| 
|OBAAAA878A2A0|      164|164| 
+-------------+---------+---+

thank you that was very helpful (y) (y) . i used JOIN to handle it — Youssef Assouli, May 23 '19 at 15:20

adding new column to my spark dataframe , and calculat the sum()

1 Answers1