Found a solution, not sure if its efficient though ;)
Define function to remove duplicates:
def remove_repeat(dupRDD):
return([dict(tup) for tup in set(tuple(item.items()) for item in dupRDD)])
# Assuming 'AAA' is main key (can pick any)
sampleRDD_2 = sampleRDD.map(lambda snap: (snap['AAA'], snap)).groupByKey()
this creates a RDD as [(111, <pyspark.resultiterable.ResultIterable object>)]
The pyspark object values can be fetched by list, passed to remove_repeat
function
sampleRDD_2.map(lambda x : (x, remove_repeat(list(x[1])))).collect()
Returns deduped list of dict at the key level: [(111,[{"A": 111, "B": 222, "C": 333}])]