How to remove duplicate entries from a json file?

Question

How can I remove duplicate entries from the json file using python? My dataset is "user_lookup_data.json":

{'id': 297162425, 'id_str': '297162425', 'name': 'Arch'}
{'id': 297162425, 'id_str': '297162425', 'name': 'Arch'}
{'id': 1257964204650192897, 'id_str': '1257964204650192897'}
{'id': 934417886159896576, 'id_str': '934417886159896576'}
{'id': 1257964204650192897, 'id_str': '1257964204650192897'}
...
...
...

My code is:

i=0
tt = pd.read_json(("/content/trending_tweets.json"), lines=True)
trending_tweets_csv = convert_to_csv(tt,"trending_tweets.csv")
f = open(("/content/trending_tweets.json"), "r+")
data = f.read()
for x in data.split("\n"):
  strlist = "[" + x + "]"
  datalist = json.loads(strlist)
  for y in datalist:
    f = open('/content/user_lookup_data.json', 'a',encoding='utf-8')
    print(y["user"]["screen_name"])
    while i < len(pred_ada_test):
      print(pred_ada_test[i])
      y["user"]["bot/not"] = pred_ada_test[i]
      i=i+1
      break
    print(y["user"]) 
    screen_name = ('@' + y["user"]["screen_name"])
    file_name = screen_name + '_tweets.csv'
    file = pd.read_csv(file_name, sep='\t')
    print(file['tweet'])

I tried to do so but got "UnsupportedOperation: not readable" error It would be great if anyone can help me

Thank you.

why not the user_lookup_data.json as a pandas df and do drop_duplicates? — Sreeram TP, Jun 03 '21 at 12:36
tried doing so got ValueError: Expected object or value while reading the json file — Chirag atha, Jun 03 '21 at 12:46
Please post the entire traceback. The error "UnsupportedOperation: not readable" indicates an issue with [file input/output](https://stackoverflow.com/questions/44901806/python-error-message-io-unsupportedoperation-not-readable), which is completely unrelated to eliminating duplicates. Consider to review the [mcve] help page as well. — MisterMiyagi, Jun 03 '21 at 12:47
Heads up that your data is not JSON nor JSON lines. JSON requires ``"`` double quotes, not ``'`` single quotes. — MisterMiyagi, Jun 03 '21 at 13:08

score 1 · Accepted Answer · answered Jun 03 '21 at 13:01

While not addressing the error you see, this will remove the duplicate entries from your user_lookup_data:

user_lookup_data = '''
{'id': 297162425, 'id_str': '297162425', 'name': 'Arch'}
{'id': 297162425, 'id_str': '297162425', 'name': 'Arch'}
{'id': 1257964204650192897, 'id_str': '1257964204650192897'}
{'id': 934417886159896576, 'id_str': '934417886159896576'}
{'id': 1257964204650192897, 'id_str': '1257964204650192897'}
'''

my_unique_user_lookup_data = set(row for row in user_lookup_data.split("\n") if row)

print("\n".join(my_unique_user_lookup_data))

This will print:

{'id': 934417886159896576, 'id_str': '934417886159896576'}
{'id': 1257964204650192897, 'id_str': '1257964204650192897'}
{'id': 297162425, 'id_str': '297162425', 'name': 'Arch'}

score 0 · Answer 2 · answered Jun 03 '21 at 12:56

0

"UnsupportedOperation: not readable" means you opened a file in a write-only mode and tried to read it.

f = open('/content/user_lookup_data.json', 'a',encoding='utf-8')

Here you opened it in 'a' mode, appending mode, which is write-only. Changing 'a' to 'a+' might solve your problem.

You can find explanations on all modes in the python docs.

answered Jun 03 '21 at 12:56

Lex

1

The file isn't actually used other than opening it, so this cannot cause the error. – MisterMiyagi Jun 03 '21 at 12:57
Yes, it would not cause an error if no read was attempted. According to the explanation in the first paragraph, "user_lookup_data.json" is the dataset. So maybe reads might have been attempted outside the posted codes. – Lex Jun 03 '21 at 13:07

score -1 · Answer 3 · answered Jun 03 '21 at 12:54

-1

import json
f=open("./data.json","r")
data=json.load(f)
li=data["data"]
arr=[]
def fun(x):    
 if(x["id"] not in arr):
    arr.append(x["id"])
    return True
 else:
    return False
arr1=list(filter(fun,li))
for i in arr1:
 print(i)

This may solve your problem

answered Jun 03 '21 at 12:54

Naveenkumar M

616
3
17

You might want to improve this answer to make it generally useful. The input data does not match the question data (which uses [JSON lines](https://jsonlines.org)). There are multiple copies and dead objects kept around, namely ``f``, ``data`` and ``arr``. Using a list for containment checks is slow for all but tiny data sets; the filtering has an O(n^2) runtime complexity. – MisterMiyagi Jun 03 '21 at 13:04

How to remove duplicate entries from a json file?

3 Answers3