1

I have a list of JSON objects (around 30,000) and would like to remove duplicates from them. I consider them a duplicate as long as ModuleCode is the same. Below is an example of one object.

[{"AveragePoints": "4207", 
"ModuleTitle": "Tool Engineering", 
"Semester": "2", 
"ModuleCode": "ME4261", 
"StudentAcctType": "P", 
"AcadYear": "2013/2014"}]

Planning to do so by hashing, following the example given here. After some experimentation I'm still unsure of how to correctly use the overloaded methods __eq__ and __hash__. Do I create a new class and contain the two methods inside?

Below is my attempt at a solution. It returns NameError: name 'obj' is not defined which I suspect is my incorrect usage of class.

import json

json_data = open('small.json')
data = json.load(json_data)

class Module(obj):
    def __eq__(self, other):
        return self.ModuleCode == other.ModuleCode

    def __hash__(self):
        return hash(('ModuleCode', self.ModuleCode))

hashtable = {} #python's dict is implemented as a hashtable

for item in data:
    cur = Module(item)
    if hashtable[hash(cur)] == item.ModuleCode:
        print "duplicate" + item.ModuleCode
    else:
        hashtable[hash(cur)] = item.ModuleCode


json_data.close()
Community
  • 1
  • 1
Yan Yi
  • 763
  • 1
  • 10
  • 29

1 Answers1

4

The problem is that you are referring to obj, which doesn't exist, instead of object. Also, you don't actually define Module.__init__, so never initialise the ModuleCode attribute. Here is one way you could do it:

class Module(object):

    def __init__(self, ModuleCode, **data):
        self.ModuleCode = ModuleCode
        self.data = data

    def __eq__(self, other):
        return self.ModuleCode == other.ModuleCode

    def __hash__(self):
        return hash(('ModuleCode', self.ModuleCode))

Then when you create the instance:

cur = Module(**item)

(If the syntax is unfamiliar, see e.g. What does ** (double star) and * (star) do for parameters?)


Also, note that you can use a set rather than a dict for removing duplicates; storing the ModuleCode as the value is duplicating information (as that's the whole point of implementing __hash__ and __eq__):

unique = set()

for item in data:
    cur = Module(**item)
    if cur in unique:
        print "duplicate" + cur.ModuleCode
    else:
        unique.add(cur)
Community
  • 1
  • 1
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
  • Man, never thought about naming arguments of interest like this to pull them out. nice! I've always done `data.pop("ModuleCode")` or something similar. – GP89 Dec 17 '14 at 16:57
  • @GP89 that's a more fault-tolerant way to do it (and allows you to customise the feedback more), at the cost of a few extra lines of code. – jonrsharpe Dec 17 '14 at 17:11
  • @jonrsharpe thanks. May I know why we have to use the keyword object? I thought python is a dynamically typed language, so the interpreter will determine the type of the thing when it is passed in. That was why I initially used an arbitrary word "obj" to represent one object to be given. – Yan Yi Dec 17 '14 at 17:30
  • 1
    @YanYi the "argument" to the class name is **not** the same as the object that is passed when you create an instance; it's the superclass to inherit from. Python is dynamic, but isn't (quite) magic, you have to specify what will be passed and how it should be dealt with. Have a look at e.g. https://docs.python.org/2/tutorial/classes.html – jonrsharpe Dec 17 '14 at 17:33