-3

I have downloaded Twitter Users' objects,

This is example of One object

{
    "id": 6253282,
    "id_str": "6253282",
    "name": "Twitter API",
    "screen_name": "TwitterAPI",
    "location": "San Francisco, CA",
    "profile_location": null,
    "description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
    "url": "https:\/\/t.co\/8IkCzCDr19",
    "entities": {
        "url": {
            "urls": [{
                "url": "https:\/\/t.co\/8IkCzCDr19",
                "expanded_url": "https:\/\/developer.twitter.com",
                "display_url": "developer.twitter.com",
                "indices": [
                    0,
                    23
                ]
            }]
        },
        "description": {
            "urls": []
        }
    },
    "protected": false,
    "followers_count": 6133636,
    "friends_count": 12,
    "listed_count": 12936,
    "created_at": "Wed May 23 06:01:13 +0000 2007",
    "favourites_count": 31,
    "utc_offset": null,
    "time_zone": null,
    "geo_enabled": null,
    "verified": true,
    "statuses_count": 3656,
    "lang": null,
    "contributors_enabled": null,
    "is_translator": null,
    "is_translation_enabled": null,
    "profile_background_color": null,
    "profile_background_image_url": null,
    "profile_background_image_url_https": null,
    "profile_background_tile": null,
    "profile_image_url": null,
    "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
    "profile_banner_url": null,
    "profile_link_color": null,
    "profile_sidebar_border_color": null,
    "profile_sidebar_fill_color": null,
    "profile_text_color": null,
    "profile_use_background_image": null,
    "has_extended_profile": null,
    "default_profile": false,
    "default_profile_image": false,
    "following": null,
    "follow_request_sent": null,
    "notifications": null,
    "translator_type": null
}

but somehow it has many duplicates, maybe the input file had duplicated values.

This is the pattern of downloaded Twitter File. I named it as rawjson { user-object }{ user-object }{ user-object }

So I ended up with a 16 GB file of users with repeated values. I need to delete the duplicated users.

This is what I have done so far

def twitterToListJsonMethodTwo(self, rawjson, twitterToListJson):
# Delete Old File
if (os.path.exists(twitterToListJson)):
    try:
        os.remove(twitterToListJson)
    except OSError:
        pass
counter = 1
objc = 1
with open(rawjson, encoding='utf8') as fin, open(twitterToListJson, 'w', encoding='utf8') as fout:
    for line in fin:
        if (line.find('}{') != -1 and len(line) == 3):
            objc = objc + 1
            fout.write(line.replace('}{', '},\n{'))
        else:
            fout.write(line)
        counter = counter + 1
        # print(counter)
    print("Process Complete: Twitter object to Total lines: ", counter)

    self.twitterToListJsonMethodOne(twitterToListJson)

and the output sample file looks like this. Now

[
    {user-object},
    {user-object},
    {user-object} 
]

While each user-object is dict But I can not find a way to remove the duplicates, all of the tutorials/solutions I found are just for small objects and small lists. I am not very good with python but I need some optimal solution as the file size is too big and memory could be a problem.

While each user-object is like below, with unique id and screen_name

E_net4
  • 27,810
  • 13
  • 101
  • 139
Adnan Ali
  • 2,851
  • 5
  • 22
  • 39
  • It would be much easier to dedupe the data *before* you write it out to disk, but you haven't shared any of that code so it's impossible to point out exactly where you'd do that. The approach I'd take would be to put everything into a dict that's keyed by `id` (so that entries with the same `id` will overwrite each other). If you need to dedupe the file itself because you're streaming data and constantly writing to the file, use a database instead of a flat file. – Samwise Dec 20 '21 at 16:22
  • what is/are the unique field[s] of the enrty? – balderman Dec 20 '21 at 16:23
  • @balderman "id": 000000 are unique values in Tweets – Adnan Ali Dec 20 '21 at 16:24
  • And you have a 16 GB file on the disk with many many entries like this? – balderman Dec 20 '21 at 16:25
  • I am not seeing the duplicates in the example user object? – dawg Dec 20 '21 at 16:25
  • @Samwise I made the list of dict. as each json obj is dict now. i add code and details in the question – Adnan Ali Dec 20 '21 at 16:25
  • @balderman yes. let me add more information and code in the Question. – Adnan Ali Dec 20 '21 at 16:26
  • @dawg because the example user object is just one object. i have thousands of the objects and many are duplicates. not posting here as Question will be too long. and hard to follow. – Adnan Ali Dec 20 '21 at 16:35
  • Where does `rawjson` come from? Again: it is much easier to fix this at the point where the duplication was introduced than to fix it after the fact. 16GB of data is a lot to load into memory. – Samwise Dec 20 '21 at 16:37
  • @Samwise pattern of downloaded Twitter File. I named it as `rawjson` – Adnan Ali Dec 20 '21 at 16:45
  • @Samwise otherwise i have to download users again. around 200K users. Obviously, if i dont find any solution, i have to do that then. – Adnan Ali Dec 20 '21 at 16:47
  • where did you download it from, though? If this is data from a Twitter API, there's probably a way to make the query that doesn't produce so many duplicates. – Samwise Dec 20 '21 at 16:49

5 Answers5

1

To process huge JSON datasets, especially long lists of objects, it's better to use JSON streaming from https://github.com/daggaz/json-stream to read the user objects one by one, then add them to your results if this user was not encountered before.

Example:

import json_stream

unique_users = []
seen_users = set()
with open('input.json') as f:
    js = json_stream.load(f)
    for us in js:
        user = dict(us.items())
        if user['id'] not in seen_users:
            unique_users.append(user)
            seen_users.add(user['id'])

The reason for user = dict(us.items()) is that if we go looking for the id in the object via the stream, we can't backtrack to get the whole object any more. So we need to "render" out every user object and then check the id.

vaizki
  • 1,678
  • 1
  • 9
  • 12
0

You could modify a merge sort and just delete duplicates in O(nlogn).

B.Quinn
  • 77
  • 1
  • 7
0

Use ijson like it is used here.
Create a set that will hold the item id.
If the id is in the set - drop the item, else - collect the item

balderman
  • 22,927
  • 7
  • 34
  • 52
0

Convert the dictionaries into tuples using the items() dict method to turn the list of dictionaries into a list of tuples. Then you can run set() on the list to get rid of duplicates because tuples are hashable. While using items() on each dict, remember to use tuple() on that. Sample code would be:

data = (tuple(d.items()) for d in twitter_data)

This should solve the issue of duplicate dictionaries if the dictionaries are identical on every key-value pairs.

Rifat Rakib
  • 134
  • 7
0

I did not find any useful and memory-efficient solution, so I downloaded the data again.

One possible solution was (Step by Step).

1- Make the input data unique (The file I used for downloading the data)

2- Then read the JSON file and copy elements to another file one by one, and keep deleting processed values from the input file to avoid duplications.

3- But it would not be memory efficient and too much work as compared to downloading data again.

In the future, if someone comes with this problem. You better download the data again.

@vaizki answer is good, maybe useful for someone, but I could not install it as, pip did not find it and conda don't works really well here (I am in China, maybe my university network have the problem or VPN)

Adnan Ali
  • 2,851
  • 5
  • 22
  • 39