0

I have a Json file which contain some duplicates and I am looking for the way to remove them. Two examples of the beginning of my Json texts:

"date": "May 16, 2012 Wednesday", "body": "THE future of one of Scotland's most important listed buildings .... World Monuments Fund. o See a picture gallery of Mavisbank House at scotsman.com/scotland ", "title": "Rescue deal to bring Adam mansion back from brink"

"date": "May 16, 2012 Wednesday", "body": "The future of one of Scotland's most important listed buildings .... World Monuments Fund.", "title": "Rescue deal to bring Adam mansion back from brink"

I have cut the text in the middle due to the extension of it and irrelevance since they match perfectly. As we can see the text matches almost 100% except at the beginning THE vs The and at the end (extra sentence: o See a picture gallery of Mavisbank House at scotsman.com/scotland). In this line I will like to come with a way to I) Find the duplicates and II) remove one of the duplicates (note that they can also be more than one duplicate). I just started programming in Python and I am not sure how to handle this problem. Any help is really appreciated!

kind regards!

Economist_Ayahuasca
  • 1,648
  • 24
  • 33
  • 6
    "As we can see the text matches almost 100%" - but what **exactly** constitutes a duplicate? – Tim Mar 16 '16 at 13:07
  • 1
    If these lines can be _exactly the same_, you could use `set` to eliminate duplicates. _Partly_ equal strings are still considered different. You can try to calculate Hamming distance to see _how 'much' different_ the strings are and then decide whether to delete them or not. – ForceBru Mar 16 '16 at 13:07

1 Answers1

0

I think it would be better if you first convert your json String into a model object.

After that you can simply iterate over the elements and remove the duplicates (to whatever level). You can ignore case while comparing each individual elements.

Also, you can simply convert each of your body/title elements to a consistent case and add them in a set for duplicate check, while iterating, as @ForceBru pointed out in comments.

Following link will point you in appropriate direction for json to object conversion.

Is there a python json library can convert json to model objects, similar to google-gson?

Hope this helps.

Community
  • 1
  • 1