0

I have a json with 1000s of lines and I won't bore you with that here but this is a quick example of the dataset.

{
    "places": [
        {
            "place_name": "123 YOU N ME PRESCHOOL",
            "address": "809 W DETWEILLER DR STE A",
            "city": "PEORIA",
            "state": "IL",
            "zip": "61614",
            "geo_location": "40.89564657,-89.60566821",
            "data": {}
        },
        {
            "place_name": "123 YOU N ME PRESCHOOL",
            "address": "809 W DETWEILLER DR STE A",
            "city": "PEORIA",
            "state": "IL",
            "zip": "61615",
            "geo_location": "40.78878653,-89.605669034",
            "data": {}
        },
        {
            "place_name": "18144 GLEN TERRACE ST.",
            "address": "18144 GLEN TERRACE ST.",
            "city": "LANSING",
            "state": "IL",
            "zip": "60438",
            "geo_location": "41.565952019,-87.556316006",
            "data": {}
        }
    ]
}

As you can see the first two places are almost identical but they have different zips and geo locations making them not a duplicate. I currently have a script that just looks at the place_name being duplicated but this is returning me places that aren't actually duplicates. Essentially I would like my script to look at each row and if one row is the same as the another row then delete that row. Here is my code thus far.

import collections
import pandas as pd
places = pd.read_json("test.json")
place_names = [item['place_name'] for item in places['places']]
print([item for item, count in collections.Counter(place_names).items() if count > 1])
Camilo Martinez M.
  • 1,420
  • 1
  • 7
  • 21

2 Answers2

2

As @user696969 pointed out, there are already useful answers in that post. The accepted answer by @jcollado works quite nicely, BUT you have a dictionary as a value inside the dictionaries, which is the value of the key data. So, running that piece of code without doing any changes to your dataset will give you TypeError: unhashable type: 'dict'. To fix that, you can implement the following check

new_d = d.copy()
for k, v in d.items(): 
    if type(v) is dict: # if a value is a dictionary, convert that value to a tuple
        new_d[k] = tuple(v.items())

So, the entire code becomes

seen = set()
new_l = []
for d in l:
    new_d = d.copy()
    for k, v in d.items():
        if type(v) is dict:
            new_d[k] = tuple(v.items())
            
    t = tuple(new_d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)

For example, let l be

l = [{
        "place_name": "123 YOU N ME PRESCHOOL",
        "address": "809 W DETWEILLER DR STE A",
        "city": "PEORIA",
        "state": "IL",
        "zip": "61614",
        "geo_location": "40.89564657,-89.60566821",
        "data": {}
    },
    {
        "place_name": "123 YOU N ME PRESCHOOL",
        "address": "809 W DETWEILLER DR STE A",
        "city": "PEORIA",
        "state": "IL",
        "zip": "61615",
        "geo_location": "40.78878653,-89.605669034",
        "data": {"hey": 1}
    },
    {
        "place_name": "18144 GLEN TERRACE ST.",
        "address": "18144 GLEN TERRACE ST.",
        "city": "LANSING",
        "state": "IL",
        "zip": "60438",
        "geo_location": "41.565952019,-87.556316006",
        "data": {"hey": 2}
    },
    {
        "place_name": "123 YOU N ME PRESCHOOL",
        "address": "809 W DETWEILLER DR STE A",
        "city": "PEORIA",
        "state": "IL",
        "zip": "61615",
        "geo_location": "40.78878653,-89.605669034",
        "data": {"hey": 1}
    },
]

After I run the code, new_l becomes

>>> new_l
    [{'place_name': '123 YOU N ME PRESCHOOL',
      'address': '809 W DETWEILLER DR STE A',
      'city': 'PEORIA',
      'state': 'IL',
      'zip': '61614',
      'geo_location': '40.89564657,-89.60566821',
      'data': {}},
     {'place_name': '123 YOU N ME PRESCHOOL',
      'address': '809 W DETWEILLER DR STE A',
      'city': 'PEORIA',
      'state': 'IL',
      'zip': '61615',
      'geo_location': '40.78878653,-89.605669034',
      'data': {'hey': 1}},
     {'place_name': '18144 GLEN TERRACE ST.',
      'address': '18144 GLEN TERRACE ST.',
      'city': 'LANSING',
      'state': 'IL',
      'zip': '60438',
      'geo_location': '41.565952019,-87.556316006',
      'data': {'hey': 2}}]

The dictionary with "data": {"hey": 1} only appears once now. If you know that the value of "data" is the only one that can be a dictionary, then remove the for loop and do

seen = set()
new_l = []
for d in l:
    d["data"] = tuple(d["data"].items())
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        d["data"] = {k: v for k, v in d["data"]} # back to dictionary
        new_l.append(d)

That will improve the performance overall.

Camilo Martinez M.
  • 1,420
  • 1
  • 7
  • 21
  • This works great but only if you put the json in a variable. But I have 1000s of lines and need to read the json in with pandas or json.load or something with causes a JSON object must be str error. –  May 09 '21 at 15:11
  • ^f = open('test.json') places = json.loads(f) –  May 09 '21 at 15:11
  • You can't load the JSON file as a dictionary? I mean, you could do `with open('test.json') as f: places = json.load(f)`, which will save the entire JSON content as a dictionary. From there, you can apply this algorithm. – Camilo Martinez M. May 09 '21 at 16:05
  • I get this error when doing that "TypeError: the JSON object must be str, bytes or bytearray, not TextIOWrapper" –  May 09 '21 at 16:14
  • As far as I know, that happens when you use `json.loads()` and not `json.load()`. – Camilo Martinez M. May 09 '21 at 17:16
1

I don't think there are more efficient way though. Usually, to remove duplicates from the list, you can use list(set(list_variable)) but that only works if list_variable only contains hashable values, since dictionary object aren't hashable, you can't use that approach. There are many approaches mentioned in this post. One alternative way I can think of is to convert to dictionary object of each place into string (str(place)) and then use them as key for a dictionary and since dictionary only store unique key string, you will only get unique places in the new dictionary

unique_places = list({str(place): place for place in places["places"]}.values())
tax evader
  • 2,082
  • 1
  • 7
  • 9
  • that doesn't look at every element of the row, and will return duplicates –  May 09 '21 at 15:17
  • @Shultz The `str(place)` will essentially turn the `place` dictionary into a JSON string so every element will be looked at in order to do so. Could you provide a sample data in your test case that caused the duplicates? – tax evader May 09 '21 at 15:40