I have a large csv file with data in following format.
cityId1,name,address,.......,zip
cityId2,name,address,.......,zip
cityId1,name,address,.......,zip
........
cityIdN,name,address,.......,zip
I am performing following operation on the above csv file:
Group by cityId as key and list of resources as value
df1.groupBy($"cityId").agg(collect_list(struct(cols.head, cols.tail: _*)) as "resources")
Change it to jsonRDD
val jsonDataRdd2 = df2.toJSON.rdd
Iterate through each Partition and upload to s3 per key
- I can not use dataframe partitionby write because of business logic constraints (how other services read from S3 )
My Questions:
- What is the default size of a spark partition?
- Let's say default size of partition is X MBs and there is one large record present in the dataFrame with key having Y MBs of data (Y > X) , what would happen in this scenario?
- Do I need to worry about having the same key in different partitions in that case?