1

I have following JSON

{
    "FileResults": [
      {
        "FileName": "gtg.0.wav",
        "FileUrl": null,
        "Results": [
          {
            "Status": "Success",
            "ChannelNumber": null,
            "SpeakerId": null,
            "Offset": 90200000,
            "Duration": 25600000,
            "NBest": [
              {
                "Confidence": 0.9415368,
                "Lexical": "",
                "ITN": "",
                "MaskedITN": "",
                "Display": ".",
                "Sentiment": null,
                "Words": [
                  {
                    "Word": "ask",
                    "Offset": 944400000,
                    "Duration": 3500000
                  },
                  {
                    "Word": "everybody",
                    "Offset": 94000000,
                    "Duration": 4400000
                  },
                  {
                    "Word": "to",
                    "Offset": 98400000,
                    "Duration": 1200000
                  },
                  {
                    "Word": "please",
                    "Offset": 99600000,
                    "Duration": 3000000
                  },
                  {
                    "Word": "take",
                    "Offset": 102600000,
                    "Duration": 2400000
                  },
                  {
                    "Word": "their",
                    "Offset": 105000000,
                    "Duration": 2400000
                  },
                  {
                    "Word": "seats",
                    "Offset": 107400000,
                    "Duration": 8200000
                  }
                ]
              }
            ]
          },
          {
            "Status": "Success",
            "ChannelNumber": null,
            "SpeakerId": null,
            "Offset": 90200000,
            "Duration": 25600000,
            "NBest": [
              {
                "Confidence": 0.9415368,
                "Lexical": "",
                "ITN": "",
                "MaskedITN": "",
                "Display": ".",
                "Sentiment": null,
                "Words": [
                  {
                    "Word": "ask",
                    "Offset": 90500000,
                    "Duration": 3500000
                  },
                  {
                    "Word": "everybody",
                    "Offset": 94000000,
                    "Duration": 4400000
                  },
                  {
                    "Word": "to",
                    "Offset": 98400000,
                    "Duration": 1200000
                  },
                  {
                    "Word": "please",
                    "Offset": 99600000,
                    "Duration": 3000000
                  },
                  {
                    "Word": "take",
                    "Offset": 102600000,
                    "Duration": 2400000
                  },
                  {
                    "Word": "their",
                    "Offset": 105000000,
                    "Duration": 2400000
                  },
                  {
                    "Word": "seats",
                    "Offset": 107400000,
                    "Duration": 8200000
                  }
                ]
              }
            ]
          },
          {
            "Status": "Success",
            "ChannelNumber": null,
            "SpeakerId": null,
            "Offset": 169400000,
            "Duration": 157500000,
            "NBest": [
              {
                "Confidence": 0.944001734,
                "Lexical": "",
                "ITN": "",
                "MaskedITN": "",
                "Display": "",
                "Sentiment": null,
                "Words": [
                  {
                    "Word": "welcome",
                    "Offset": 169700000,
                    "Duration": 4500000
                  },
                  {
                    "Word": "to",
                    "Offset": 174200000,
                    "Duration": 2600000
                  },
                  {
                    "Word": "the",
                    "Offset": 176800000,
                    "Duration": 8600000
                  },
                  {
                    "Word": "scheduled",
                    "Offset": 186500000,
                    "Duration": 7900000
                  },
                  {
                    "Word": "special",
                    "Offset": 194400000,
                    "Duration": 6000000
                  },
                  {
                    "Word": "budget",
                    "Offset": 200400000,
                    "Duration": 4400000
                  },
                  {
                    "Word": "hearings",
                    "Offset": 204800000,
                    "Duration": 6400000
                  },
                  {
                    "Word": "meeting",
                    "Offset": 211400000,
                    "Duration": 4800000
                  },
                  {
                    "Word": "of",
                    "Offset": 216200000,
                    "Duration": 1600000
                  },
                  {
                    "Word": "the",
                    "Offset": 217800000,
                    "Duration": 1300000
                  },
                  {
                    "Word": "los",
                    "Offset": 219100000,
                    "Duration": 2300000
                  },
                  {
                    "Word": "lm",
                    "Offset": 221400000,
                    "Duration": 3600000
                  },
                  {
                    "Word": "mk",
                    "Offset": 225000000,
                    "Duration": 5500000
                  },
                  {
                    "Word": "board",
                    "Offset": 231800000,
                    "Duration": 4600000
                  },
                  {
                    "Word": "of",
                    "Offset": 236400000,
                    "Duration": 1000000
                  },
                  {
                    "Word": "supervisors",
                    "Offset": 237400000,
                    "Duration": 9200000
                  },
                  {
                    "Word": "seems",
                    "Offset": 246600000,
                    "Duration": 3000000
                  },
                  {
                    "Word": "like",
                    "Offset": 249600000,
                    "Duration": 2400000
                  },
                  {
                    "Word": "we",
                    "Offset": 252000000,
                    "Duration": 1400000
                  },
                  {
                    "Word": "were",
                    "Offset": 253400000,
                    "Duration": 1600000
                  },
                  {
                    "Word": "just",
                    "Offset": 255000000,
                    "Duration": 3400000
                  },
                  {
                    "Word": "here",
                    "Offset": 258400000,
                    "Duration": 5500000
                  },
                  {
                    "Word": "but",
                    "Offset": 270200000,
                    "Duration": 4000000
                  },
                  {
                    "Word": "no",
                    "Offset": 274200000,
                    "Duration": 3000000
                  },
                  {
                    "Word": "it's",
                    "Offset": 277200000,
                    "Duration": 1600000
                  },
                  {
                    "Word": "wednesday",
                    "Offset": 278800000,
                    "Duration": 6700000
                  },
                  {
                    "Word": "may",
                    "Offset": 288600000,
                    "Duration": 3800000
                  },
                  {
                    "Word": "sixteenth",
                    "Offset": 292400000,
                    "Duration": 8800000
                  },
                  {
                    "Word": "full",
                    "Offset": 307200000,
                    "Duration": 4600000
                  },
                  {
                    "Word": "complement",
                    "Offset": 311800000,
                    "Duration": 6600000
                  },
                  {
                    "Word": "not",
                    "Offset": 318400000,
                    "Duration": 3000000
                  },
                  {
                    "Word": "quite",
                    "Offset": 321400000,
                    "Duration": 5300000
                  }
                ]
              }
            ]
          }
              ]
            }
          ]
        }

I would like to remove duplicates from the JSON only

For instance "Word": "ask" came twice; I would like to retain first occurrence of "Word": "ask" and remove second.

{
"Word": "welcome",
"Offset": 169700000,
"Duration": 4500000
},

I have tried various dedup techniques but nothing is helping

Here is my sample code:

import json

with open('example1.json') as json_data:
    obj = json.load(json_data)
    #attr = lambda x: x['hdfs:batchprocessing'][0]['application']['app_id']+x['hdfs:batchprocessing'][0]['application']['app_id']
    el_set = set()
    el_list = []
    for el in obj:
        if str(el) not in el_set:
            el_set.add(str(el))
            el_list.append(el)

open("updated_structure.json", "w").write(
    json.dumps(el_list, sort_keys=True, indent=4, separators=(',', ': '))
)

JSON without any duplicate values for "Word"

1 Answers1

0

Here ('data' is the data struct from the post)

The code removes duplicate words from 'data'

import copy
import pprint

data = {
    "FileResults": [
        {
            "FileName": "gtg.0.wav",
            "FileUrl": None,
            "Results": [
                {
                    "Status": "Success",
                    "ChannelNumber": None,
                    "SpeakerId": None,
                    "Offset": 90200000,
                    "Duration": 25600000,
                    "NBest": [
                        {
                            "Confidence": 0.9415368,
                            "Lexical": "",
                            "ITN": "",
                            "MaskedITN": "",
                            "Display": ".",
                            "Sentiment": None,
                            "Words": [
                                {
                                    "Word": "ask",
                                    "Offset": 944400000,
                                    "Duration": 3500000
                                },
                                {
                                    "Word": "everybody",
                                    "Offset": 94000000,
                                    "Duration": 4400000
                                },
                                {
                                    "Word": "to",
                                    "Offset": 98400000,
                                    "Duration": 1200000
                                },
                                {
                                    "Word": "please",
                                    "Offset": 99600000,
                                    "Duration": 3000000
                                },
                                {
                                    "Word": "take",
                                    "Offset": 102600000,
                                    "Duration": 2400000
                                },
                                {
                                    "Word": "their",
                                    "Offset": 105000000,
                                    "Duration": 2400000
                                },
                                {
                                    "Word": "seats",
                                    "Offset": 107400000,
                                    "Duration": 8200000
                                }
                            ]
                        }
                    ]
                },
                {
                    "Status": "Success",
                    "ChannelNumber": None,
                    "SpeakerId": None,
                    "Offset": 90200000,
                    "Duration": 25600000,
                    "NBest": [
                        {
                            "Confidence": 0.9415368,
                            "Lexical": "",
                            "ITN": "",
                            "MaskedITN": "",
                            "Display": ".",
                            "Sentiment": None,
                            "Words": [
                                {
                                    "Word": "ask",
                                    "Offset": 90500000,
                                    "Duration": 3500000
                                },
                                {
                                    "Word": "everybody",
                                    "Offset": 94000000,
                                    "Duration": 4400000
                                },
                                {
                                    "Word": "to",
                                    "Offset": 98400000,
                                    "Duration": 1200000
                                },
                                {
                                    "Word": "please",
                                    "Offset": 99600000,
                                    "Duration": 3000000
                                },
                                {
                                    "Word": "take",
                                    "Offset": 102600000,
                                    "Duration": 2400000
                                },
                                {
                                    "Word": "their",
                                    "Offset": 105000000,
                                    "Duration": 2400000
                                },
                                {
                                    "Word": "seats",
                                    "Offset": 107400000,
                                    "Duration": 8200000
                                }
                            ]
                        }
                    ]
                },
                {
                    "Status": "Success",
                    "ChannelNumber": None,
                    "SpeakerId": None,
                    "Offset": 169400000,
                    "Duration": 157500000,
                    "NBest": [
                        {
                            "Confidence": 0.944001734,
                            "Lexical": "",
                            "ITN": "",
                            "MaskedITN": "",
                            "Display": "",
                            "Sentiment": None,
                            "Words": [
                                {
                                    "Word": "welcome",
                                    "Offset": 169700000,
                                    "Duration": 4500000
                                },
                                {
                                    "Word": "to",
                                    "Offset": 174200000,
                                    "Duration": 2600000
                                },
                                {
                                    "Word": "the",
                                    "Offset": 176800000,
                                    "Duration": 8600000
                                },
                                {
                                    "Word": "scheduled",
                                    "Offset": 186500000,
                                    "Duration": 7900000
                                },
                                {
                                    "Word": "special",
                                    "Offset": 194400000,
                                    "Duration": 6000000
                                },
                                {
                                    "Word": "budget",
                                    "Offset": 200400000,
                                    "Duration": 4400000
                                },
                                {
                                    "Word": "hearings",
                                    "Offset": 204800000,
                                    "Duration": 6400000
                                },
                                {
                                    "Word": "meeting",
                                    "Offset": 211400000,
                                    "Duration": 4800000
                                },
                                {
                                    "Word": "of",
                                    "Offset": 216200000,
                                    "Duration": 1600000
                                },
                                {
                                    "Word": "the",
                                    "Offset": 217800000,
                                    "Duration": 1300000
                                },
                                {
                                    "Word": "los",
                                    "Offset": 219100000,
                                    "Duration": 2300000
                                },
                                {
                                    "Word": "lm",
                                    "Offset": 221400000,
                                    "Duration": 3600000
                                },
                                {
                                    "Word": "mk",
                                    "Offset": 225000000,
                                    "Duration": 5500000
                                },
                                {
                                    "Word": "board",
                                    "Offset": 231800000,
                                    "Duration": 4600000
                                },
                                {
                                    "Word": "of",
                                    "Offset": 236400000,
                                    "Duration": 1000000
                                },
                                {
                                    "Word": "supervisors",
                                    "Offset": 237400000,
                                    "Duration": 9200000
                                },
                                {
                                    "Word": "seems",
                                    "Offset": 246600000,
                                    "Duration": 3000000
                                },
                                {
                                    "Word": "like",
                                    "Offset": 249600000,
                                    "Duration": 2400000
                                },
                                {
                                    "Word": "we",
                                    "Offset": 252000000,
                                    "Duration": 1400000
                                },
                                {
                                    "Word": "were",
                                    "Offset": 253400000,
                                    "Duration": 1600000
                                },
                                {
                                    "Word": "just",
                                    "Offset": 255000000,
                                    "Duration": 3400000
                                },
                                {
                                    "Word": "here",
                                    "Offset": 258400000,
                                    "Duration": 5500000
                                },
                                {
                                    "Word": "but",
                                    "Offset": 270200000,
                                    "Duration": 4000000
                                },
                                {
                                    "Word": "no",
                                    "Offset": 274200000,
                                    "Duration": 3000000
                                },
                                {
                                    "Word": "it's",
                                    "Offset": 277200000,
                                    "Duration": 1600000
                                },
                                {
                                    "Word": "wednesday",
                                    "Offset": 278800000,
                                    "Duration": 6700000
                                },
                                {
                                    "Word": "may",
                                    "Offset": 288600000,
                                    "Duration": 3800000
                                },
                                {
                                    "Word": "sixteenth",
                                    "Offset": 292400000,
                                    "Duration": 8800000
                                },
                                {
                                    "Word": "full",
                                    "Offset": 307200000,
                                    "Duration": 4600000
                                },
                                {
                                    "Word": "complement",
                                    "Offset": 311800000,
                                    "Duration": 6600000
                                },
                                {
                                    "Word": "not",
                                    "Offset": 318400000,
                                    "Duration": 3000000
                                },
                                {
                                    "Word": "quite",
                                    "Offset": 321400000,
                                    "Duration": 5300000
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}
words_set = set()
for entry in data['FileResults']:
    for result in entry['Results']:
        for nbsets_dict in result['NBest']:
            clone = copy.deepcopy(nbsets_dict['Words'])
            tmp = []
            for idx, words in enumerate(nbsets_dict['Words']):
                if words['Word'] in words_set:
                    print('About to remove entry: ' + words['Word'])
                    tmp.append(idx)
                else:
                    words_set.add(words['Word'])
            for idx in sorted(tmp,reverse=True):
                del clone[idx]
            nbsets_dict['Words'] = clone

pprint.pprint(data)
balderman
  • 22,927
  • 7
  • 34
  • 52
  • what are imports for this solution? Thanks for the effort – Plumb InFront Jun 29 '19 at 12:32
  • getting error line 302, in for entry in data['FileResults']: TypeError: string indices must be integers – Plumb InFront Jun 29 '19 at 12:34
  • @PlumbInFront I have added a full version of the code. Give it a try. – balderman Jun 29 '19 at 12:41
  • that worked, thanks ... if I have to remove first value rather than second, which line I should change ? – Plumb InFront Jun 29 '19 at 13:14
  • its also removing other values, I only want to remove "Words": [] not any other element. – Plumb InFront Jun 29 '19 at 13:19
  • It is important that you will understand how it works. The idea is to store the words in a `set` named `words_set`. The set contains only unique words. So while we iterate over the words we ask "Did we see this word in the past? ". How do we ask this question? - we look in `words_set` and check if the word is there. If the answer is positive we keep the index of the word we want to remove in `tmp`. When we are done with a given set of words we clean the words we need to clean but we work on a copy of the words list named `clone`. After the cleanup we add the `clone` to the main data struct. – balderman Jun 29 '19 at 13:20
  • The element that is being removed is an entry in Words list that looks like `{ "Word": "word_value_here", "Offset": 277200000, "Duration": 1600000 },` – balderman Jun 29 '19 at 13:23
  • thanks for explanation, one last question the JSON printing out is not valid not sure why – Plumb InFront Jun 29 '19 at 13:34
  • We dont work here with JSON. We work here with a python `dict` data structure. – balderman Jun 29 '19 at 13:37