1

I have duplicated records in my file collected as a list of dictionaries. Here is my sampleRDD variable content, which is a pyspark.rdd.RDD object:

[{"A": 111, "B": 222, "C": 333}
,{"A": 111, "B": 222, "C": 333}]

I would like to get only one record as following:

[{"A": 111, "B": 222, "C": 333}]
FunkySayu
  • 7,641
  • 10
  • 38
  • 61
Prasanna GR
  • 13
  • 1
  • 4
  • Try the `distinct` method of `RDD` – Ton Torres Jan 18 '16 at 02:00
  • I believe, given that elements within my list being dictionary - distinct method returns the following error : in mergeValues d[k] = comb(d[k], v) if k in d else creator(v) TypeError: unhashable type: 'dict' – Prasanna GR Jan 18 '16 at 02:14
  • To add more context {"A": 111, "B": 222, "C": 333} comes from a JSON file, just that in this case comes twice :( – Prasanna GR Jan 18 '16 at 02:22

2 Answers2

1

There is a problem doing Pyspark distinct on a list of dictionaries. This is a way around it:

temp = sc.parallelize([{"A": 111, "B": 222, "C": 333}
,{"A": 111, "B": 222, "C": 333}])

print temp.map(lambda x: tuple(x.iteritems())).distinct().collect()
    >>[(('A', 111), ('C', 333), ('B', 222))]

Or if you need it back in dictionary form:

print temp.map(lambda x: tuple(x.iteritems())).distinct().map(lambda x: dict(x)).collect()
    >>[{'A': 111, 'C': 333, 'B': 222}]
user3689574
  • 1,596
  • 1
  • 11
  • 20
0

Found a solution, not sure if its efficient though ;)

Define function to remove duplicates:

def remove_repeat(dupRDD):
    return([dict(tup) for tup in set(tuple(item.items()) for item in dupRDD)])

# Assuming 'AAA' is main key (can pick any)
sampleRDD_2 = sampleRDD.map(lambda snap: (snap['AAA'], snap)).groupByKey()

this creates a RDD as [(111, <pyspark.resultiterable.ResultIterable object>)]

The pyspark object values can be fetched by list, passed to remove_repeat function

sampleRDD_2.map(lambda x : (x, remove_repeat(list(x[1])))).collect()

Returns deduped list of dict at the key level: [(111,[{"A": 111, "B": 222, "C": 333}])]

FunkySayu
  • 7,641
  • 10
  • 38
  • 61
Prasanna GR
  • 13
  • 1
  • 4