I have a json with 1000s of lines and I won't bore you with that here but this is a quick example of the dataset.
{
"places": [
{
"place_name": "123 YOU N ME PRESCHOOL",
"address": "809 W DETWEILLER DR STE A",
"city": "PEORIA",
"state": "IL",
"zip": "61614",
"geo_location": "40.89564657,-89.60566821",
"data": {}
},
{
"place_name": "123 YOU N ME PRESCHOOL",
"address": "809 W DETWEILLER DR STE A",
"city": "PEORIA",
"state": "IL",
"zip": "61615",
"geo_location": "40.78878653,-89.605669034",
"data": {}
},
{
"place_name": "18144 GLEN TERRACE ST.",
"address": "18144 GLEN TERRACE ST.",
"city": "LANSING",
"state": "IL",
"zip": "60438",
"geo_location": "41.565952019,-87.556316006",
"data": {}
}
]
}
As you can see the first two places are almost identical but they have different zips and geo locations making them not a duplicate. I currently have a script that just looks at the place_name being duplicated but this is returning me places that aren't actually duplicates. Essentially I would like my script to look at each row and if one row is the same as the another row then delete that row. Here is my code thus far.
import collections
import pandas as pd
places = pd.read_json("test.json")
place_names = [item['place_name'] for item in places['places']]
print([item for item, count in collections.Counter(place_names).items() if count > 1])