1

I have a dataframe OwnerMaster

accoutMasterId | OwnerMasterId |Owner name |

123            | ABC           | Jack      |

456            | DEF           | Amy       |

789            | ABC           | Rach      |

I want a new dataframe which have data like:

accoutMasterIdArray | OwnerMasterId 

{123,789}           | ABC    

{456}               | DEF 

accoutMasterIdArray field will be ArrayType. Any suggestions?

ForeverLearner
  • 1,901
  • 2
  • 28
  • 51
Shruti Gusain
  • 67
  • 1
  • 6

1 Answers1

2

Use .groupBy and collect_list functions to create array.

//sample dataframe 
ownerMaster.show()
//+---------------+-------------+---------+
//|accountMasterId|OwnerMasterId|Ownername|
//+---------------+-------------+---------+
//|            123|          ABC|     Jack|
//|            456|          DEF|      Amy|
//|            789|          ABC|     Rach|
//+---------------+-------------+---------+

ownerMaster.groupBy("OwnerMasterId").
agg(collect_list(col("accountMasterId")).alias("accoutMasterIdArray")).
show()

//casting array as string type then write as csv file
ownerMaster.groupBy("OwnerMasterId").
agg(collect_list(col("accountMasterId")).cast("string").alias("accoutMasterIdArray")).
show()
//+-------------+-------------------+
//|OwnerMasterId|accoutMasterIdArray|
//+-------------+-------------------+
//|          DEF|              [456]|
//|          ABC|         [123, 789]|
//+-------------+-------------------+

//schema
ownerMaster.groupBy("OwnerMasterId").agg(collect_list(col("accountMasterId")).alias("accoutMasterIdArray")).printSchema
//root
// |-- OwnerMasterId: string (nullable = true)
// |-- accoutMasterIdArray: array (nullable = true)
// |    |-- element: integer (containsNull = true)
notNull
  • 30,258
  • 4
  • 35
  • 50
  • I am further converting the dataframe into dynamic frame and then using this in my glue job: glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://apexfiledrop/entities"}"""), transformationContext = "datasink", format = "csv").writeDynamicFrame(joinDF) Finding this error: com.amazonaws.services.glue.util.SchemaException Cannot write array field associatedAccounts to CSV – Shruti Gusain May 11 '20 at 18:50
  • @ShrutiGusain, Cast the `array` column to `string` then write as csv file. `ownerMaster.groupBy("OwnerMasterId").agg(collect_list(col("accountMasterId")).cast("string").alias("accoutMasterIdArray"))` https://stackoverflow.com/questions/40426106/spark-2-0-x-dump-a-csv-file-from-a-dataframe-containing-one-array-of-type-string – notNull May 11 '20 at 20:01
  • Can you please help me on the below mentioned post: It will be a great help. Thanks in advance. https://stackoverflow.com/questions/61815514/remove-null-array-field-from-dataframe-while-converting-it-to-json – Shruti Gusain May 16 '20 at 07:08