0

I have seen multiple posts but the aggregation is done on multiple columns , but I want the aggregation based on col OPTION_CD, based on the following condition: If have conditions attached to the dataframe query, which is giving me the error 'DataFrame' object has no attribute '_get_object_id'

IF NULL(STRING AGG(OPTION_CD,'' order by OPTION_CD),''). What I can understand is that if OPTION_CD col is null then place a blank else append the OPTION_CD in one row separated by a blank.Following is the sample table :

original CSV on which df is create

first there is filteration to get only 1 and 2 from COl 1, then the result should be like this : enter image description here

Following is the query that I am writing on my dataframe

df_result = df.filter((df.COL1 == 1)|(df.COL1 == 2)).select(df.COL1,df.COL2,(when(df.OPTION_CD == "NULL", " ").otherwise(df.groupBy(df.OPTION_CD))).agg( collect_list(df.OPTION_CD)))

But not getting the desired results. Can anyone help in this? I am using pyspark.

tarun kumar Sharma
  • 897
  • 11
  • 21
  • Please write your code in the appropriate format and not only parts. This makes little sense because it seems you try to group by and aggregate on the same column (option_cd) – Michail N Jun 26 '18 at 09:48
  • Maybe you want to group by with Col1 Col2 and aggregate with collect_list on option_cd column? – Michail N Jun 26 '18 at 09:51
  • I have updated the code for your reference . Yes @Michail if the COL1 and COL2 are same then the option_cd values must be appended by a blank else there is no change – tarun kumar Sharma Jun 26 '18 at 09:54
  • What data type you want the option column in the final dataframe to be? Because it seems that you want to mix a list that comes from aggregate and a string. Also in the first dataframe you have spaces or Null values? – Michail N Jun 26 '18 at 10:29
  • Possible duplicate of [SPARK SQL replacement for mysql GROUP\_CONCAT aggregate function](https://stackoverflow.com/questions/31640729/spark-sql-replacement-for-mysql-group-concat-aggregate-function) – Alper t. Turker Jun 26 '18 at 10:55
  • @Michael, the dataType for all the cols is Integer , and i was trying to append the OPTION_CD with a blank , for ex : 4 5 2 – tarun kumar Sharma Jun 26 '18 at 11:21
  • If it is integer why you cjeck with space ' ' ? In the final dataframe it cannot be an integer because you aggregate to a list – Michail N Jun 26 '18 at 11:47
  • @Michail, the final aggregated col is string, may be there is a typecasting.I tried the following code, (when(df.OPTION_CD == "NULL", " ").otherwise(df.groupBy(df.COL1).agg(collect_list(df.OPTION_CD)))) – tarun kumar Sharma Jun 26 '18 at 12:03
  • I tried the above comment code but getting the following error : 'DataFrame' object has no attribute '_get_object_id' – tarun kumar Sharma Jun 26 '18 at 12:04
  • @user8371915 I have tried some solutions from the link that you have provided , but I am getting the error : 'DataFrame' object has no attribute '_get_object_id' when i run the code : (when(df.OPTION_CD == "NULL", " ").otherwise(df.groupBy(df.COL1).agg(collect_list(df.OPTION_CD)))) – tarun kumar Sharma Jun 26 '18 at 12:07

1 Answers1

1

You do not express your question clearly enough but I will make a try to answer it.

You need to understand that a dataframe column can have only one data type for all the rows. If you initial data are integers, then you can not check for string equality with the empty string but rather with Null value.

Also collect list returns an array of integers, so you cannot have [7 , 5] in one row and "'" in another row. In any way does this work for you?

from pyspark.sql.functions import col, collect_list

listOfTuples = [(1, 3, 1),(2, 3, 2),(1, 4, 5),(1, 4, 7),(5, 5, 8),(4, 1, 3),(2,4,None)]
df = spark.createDataFrame(listOfTuples , ["A", "B", "option"])
df.show()
>>>
+---+---+------+
|  A|  B|option|
+---+---+------+
|  1|  3|     1|
|  2|  3|     2|
|  1|  4|     5|
|  1|  4|     7|
|  5|  5|     8|
|  4|  1|     3|
|  2|  4|  null|
+---+---+------+


dfFinal = df.filter((df.A == 1)|(df.A == 2)).groupby(['A','B']).agg(collect_list(df['option']))
dfFinal.show()
>>>   
+---+---+--------------------+
|  A|  B|collect_list(option)|
+---+---+--------------------+
|  1|  3|                 [1]|
|  1|  4|              [5, 7]|
|  2|  3|                 [2]|
|  2|  4|                  []|
+---+---+--------------------+
Michail N
  • 3,647
  • 2
  • 32
  • 51