I have a list of JSON objects (around 30,000) and would like to remove duplicates from them. I consider them a duplicate as long as ModuleCode
is the same. Below is an example of one object.
[{"AveragePoints": "4207",
"ModuleTitle": "Tool Engineering",
"Semester": "2",
"ModuleCode": "ME4261",
"StudentAcctType": "P",
"AcadYear": "2013/2014"}]
Planning to do so by hashing, following the example given here. After some experimentation I'm still unsure of how to correctly use the overloaded methods __eq__
and __hash__
. Do I create a new class and contain the two methods inside?
Below is my attempt at a solution. It returns NameError: name 'obj' is not defined
which I suspect is my incorrect usage of class.
import json
json_data = open('small.json')
data = json.load(json_data)
class Module(obj):
def __eq__(self, other):
return self.ModuleCode == other.ModuleCode
def __hash__(self):
return hash(('ModuleCode', self.ModuleCode))
hashtable = {} #python's dict is implemented as a hashtable
for item in data:
cur = Module(item)
if hashtable[hash(cur)] == item.ModuleCode:
print "duplicate" + item.ModuleCode
else:
hashtable[hash(cur)] = item.ModuleCode
json_data.close()