0

I've a JSON object full of properties where some of them are randomly repeated. I want to remove those repeated ones based on the "word" index and only keep the first occurrence as in the example:

{ "word" : "Apple", "meaning" : "First meaning" },
{ "word" : "Ball", "meaning" : " \u090f\u0909\u091f\u093e" },
{ "word" : "Cat", "meaning" : " \u090f\u0909\u091f\u093e" },
{ "word" : "Apple", "meaning" : "Repeated, but has another meaning" },
{ "word" : "Doll", "meaning" : " \u090f\u0909\u091f\u093e" },

I'm a Python beginner and am not able to come up ahead of this solution so far:

#!/usr/bin
import json

source="/var/www/dictionary/repeated.json"
destination="/var/www/dictionary/corrected.json"

def remove_redundant():

    with open(source, "r") as src:      
        src_object = json.load(src)

        for i in xrange(len(src_object)):

            escape = 1

            for j in xrange(len(src_object)):

                if src_object[j]["word"] == src_object[i]["word"]:

                    # leave the first occurance
                    if escape == 1:
                        escape = 2
                        continue
                    else:
                        src_object.pop(j)

    # open(destination, "w+").write(json.dumps(src_object, sort_keys=True, indent=4, separators=(',', ': ')))

    src.close()

remove_redundant()

The error that I keep getting is IndexError: list index out of range because the len is changing constantly. Thanks for any help.

Tom
  • 316
  • 2
  • 9
  • 30

2 Answers2

1

For reference here is an example using pop()

a = [{ "word" : "Apple", "meaning" : "First meaning" },
     { "word" : "Ball", "meaning" : " \u090f\u0909\u091f\u093e" },
     { "word" : "Cat", "meaning" : " \u090f\u0909\u091f\u093e" },
     { "word" : "Apple", "meaning" : "Repeated, but has another meaning" },
     { "word" : "Doll", "meaning" : " \u090f\u0909\u091f\u093e" },]

b = list()
keys = set()

while a:
    x = a.pop(0)
    if x['word'] not in keys:
        keys.add(x['word'])
        b.append(x)
a = b
del b
del keys

a now contains:

[{'meaning': 'First meaning', 'word': 'Apple'},
 {'meaning': ' \\u090f\\u0909\\u091f\\u093e', 'word': 'Ball'},
 {'meaning': ' \\u090f\\u0909\\u091f\\u093e', 'word': 'Cat'},
 {'meaning': ' \\u090f\\u0909\\u091f\\u093e', 'word': 'Doll'}]
Kimvais
  • 38,306
  • 16
  • 108
  • 142
  • Seems good.. but both this and the other solution results in unnecessary `u` infront appended like this: [{u'meaning': u' \First result', u'word': u'Apple'}, {u'meaning': u' \u090f\u0909\u091f\u093e', u'word': u'Ball'}, {u'meaning': u'\u0905\u0930\u092c \u0926\u0947\u0936\u092e\u093e \u0932 – Tom Apr 29 '14 at 08:58
  • Nope, that's not unnecessary - it just means that it's unicode. – Kimvais Apr 29 '14 at 11:37
1

You can simply do

from collections import OrderedDict
d = OrderedDict()
for item in data:
    if item["word"] not in d:
        d[item["word"]] = item

print d.values()

Output

[{'meaning': 'First meaning', 'word': 'Apple'},
 {'meaning': ' \\u090f\\u0909\\u091f\\u093e', 'word': 'Ball'},
 {'meaning': ' \\u090f\\u0909\\u091f\\u093e', 'word': 'Cat'},
 {'meaning': ' \\u090f\\u0909\\u091f\\u093e', 'word': 'Doll'}]
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
  • Seems good.. but both this and the other solution results in unnecessary u infront appended like this: [{u'meaning': u' \First result', u'word': u'Apple'}, {u'meaning': u' \u090f\u0909\u091f\u093e', u'word': u'Ball'}, {u'meaning': u'\u0905\u0930\u092c \u0926\u0947\u0936\u092e\u093e \u0932.. guess it has something to do with encoding? – Tom Apr 29 '14 at 08:59
  • 1
    @Sushil It means that the string is a unicode string, you don't have to worry about that. If that really disturbs you, you can use `str` function to get rid of that :) – thefourtheye Apr 29 '14 at 09:01