Deleting duplicates from List of dict elements (created from Twitter json objects)

Question

I have downloaded Twitter Users' objects,

This is example of One object

{
    "id": 6253282,
    "id_str": "6253282",
    "name": "Twitter API",
    "screen_name": "TwitterAPI",
    "location": "San Francisco, CA",
    "profile_location": null,
    "description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
    "url": "https:\/\/t.co\/8IkCzCDr19",
    "entities": {
        "url": {
            "urls": [{
                "url": "https:\/\/t.co\/8IkCzCDr19",
                "expanded_url": "https:\/\/developer.twitter.com",
                "display_url": "developer.twitter.com",
                "indices": [
                    0,
                    23
                ]
            }]
        },
        "description": {
            "urls": []
        }
    },
    "protected": false,
    "followers_count": 6133636,
    "friends_count": 12,
    "listed_count": 12936,
    "created_at": "Wed May 23 06:01:13 +0000 2007",
    "favourites_count": 31,
    "utc_offset": null,
    "time_zone": null,
    "geo_enabled": null,
    "verified": true,
    "statuses_count": 3656,
    "lang": null,
    "contributors_enabled": null,
    "is_translator": null,
    "is_translation_enabled": null,
    "profile_background_color": null,
    "profile_background_image_url": null,
    "profile_background_image_url_https": null,
    "profile_background_tile": null,
    "profile_image_url": null,
    "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
    "profile_banner_url": null,
    "profile_link_color": null,
    "profile_sidebar_border_color": null,
    "profile_sidebar_fill_color": null,
    "profile_text_color": null,
    "profile_use_background_image": null,
    "has_extended_profile": null,
    "default_profile": false,
    "default_profile_image": false,
    "following": null,
    "follow_request_sent": null,
    "notifications": null,
    "translator_type": null
}

but somehow it has many duplicates, maybe the input file had duplicated values.

This is the pattern of downloaded Twitter File. I named it as rawjson { user-object }{ user-object }{ user-object }

So I ended up with a 16 GB file of users with repeated values. I need to delete the duplicated users.

This is what I have done so far

def twitterToListJsonMethodTwo(self, rawjson, twitterToListJson):
# Delete Old File
if (os.path.exists(twitterToListJson)):
    try:
        os.remove(twitterToListJson)
    except OSError:
        pass
counter = 1
objc = 1
with open(rawjson, encoding='utf8') as fin, open(twitterToListJson, 'w', encoding='utf8') as fout:
    for line in fin:
        if (line.find('}{') != -1 and len(line) == 3):
            objc = objc + 1
            fout.write(line.replace('}{', '},\n{'))
        else:
            fout.write(line)
        counter = counter + 1
        # print(counter)
    print("Process Complete: Twitter object to Total lines: ", counter)

    self.twitterToListJsonMethodOne(twitterToListJson)

and the output sample file looks like this. Now

[
    {user-object},
    {user-object},
    {user-object} 
]

While each user-object is dict But I can not find a way to remove the duplicates, all of the tutorials/solutions I found are just for small objects and small lists. I am not very good with python but I need some optimal solution as the file size is too big and memory could be a problem.

While each user-object is like below, with unique id and screen_name

It would be much easier to dedupe the data *before* you write it out to disk, but you haven't shared any of that code so it's impossible to point out exactly where you'd do that. The approach I'd take would be to put everything into a dict that's keyed by `id` (so that entries with the same `id` will overwrite each other). If you need to dedupe the file itself because you're streaming data and constantly writing to the file, use a database instead of a flat file. — Samwise, Dec 20 '21 at 16:22
And you have a 16 GB file on the disk with many many entries like this? — balderman, Dec 20 '21 at 16:25
@Samwise I made the list of dict. as each json obj is dict now. i add code and details in the question — Adnan Ali, Dec 20 '21 at 16:25
@balderman yes. let me add more information and code in the Question. — Adnan Ali, Dec 20 '21 at 16:26
@dawg because the example user object is just one object. i have thousands of the objects and many are duplicates. not posting here as Question will be too long. and hard to follow. — Adnan Ali, Dec 20 '21 at 16:35
Where does `rawjson` come from? Again: it is much easier to fix this at the point where the duplication was introduced than to fix it after the fact. 16GB of data is a lot to load into memory. — Samwise, Dec 20 '21 at 16:37
@Samwise pattern of downloaded Twitter File. I named it as `rawjson` — Adnan Ali, Dec 20 '21 at 16:45
@Samwise otherwise i have to download users again. around 200K users. Obviously, if i dont find any solution, i have to do that then. — Adnan Ali, Dec 20 '21 at 16:47
where did you download it from, though? If this is data from a Twitter API, there's probably a way to make the query that doesn't produce so many duplicates. — Samwise, Dec 20 '21 at 16:49

score 1 · Answer 1 · answered Dec 20 '21 at 16:38

To process huge JSON datasets, especially long lists of objects, it's better to use JSON streaming from https://github.com/daggaz/json-stream to read the user objects one by one, then add them to your results if this user was not encountered before.

Example:

import json_stream

unique_users = []
seen_users = set()
with open('input.json') as f:
    js = json_stream.load(f)
    for us in js:
        user = dict(us.items())
        if user['id'] not in seen_users:
            unique_users.append(user)
            seen_users.add(user['id'])

The reason for user = dict(us.items()) is that if we go looking for the id in the object via the stream, we can't backtrack to get the whole object any more. So we need to "render" out every user object and then check the id.

can not install by pip or conda? have to download and build? this json_stream? — Adnan Ali, Dec 20 '21 at 16:43

score 0 · Answer 2 · answered Dec 20 '21 at 16:29

0

You could modify a merge sort and just delete duplicates in O(nlogn).

answered Dec 20 '21 at 16:29

B.Quinn

77
1
7

score 0 · Answer 3 · answered Dec 20 '21 at 16:32

0

Use ijson like it is used here.
Create a set that will hold the item id.
If the id is in the set - drop the item, else - collect the item

answered Dec 20 '21 at 16:32

balderman

22,927
7
34
52

score 0 · Answer 4 · answered Dec 20 '21 at 16:35

Convert the dictionaries into tuples using the items() dict method to turn the list of dictionaries into a list of tuples. Then you can run set() on the list to get rid of duplicates because tuples are hashable. While using items() on each dict, remember to use tuple() on that. Sample code would be:

data = (tuple(d.items()) for d in twitter_data)

This should solve the issue of duplicate dictionaries if the dictionaries are identical on every key-value pairs.

score 0 · Answer 5 · answered Dec 22 '21 at 03:51

I did not find any useful and memory-efficient solution, so I downloaded the data again.

One possible solution was (Step by Step).

1- Make the input data unique (The file I used for downloading the data)

2- Then read the JSON file and copy elements to another file one by one, and keep deleting processed values from the input file to avoid duplications.

3- But it would not be memory efficient and too much work as compared to downloading data again.

In the future, if someone comes with this problem. You better download the data again.

@vaizki answer is good, maybe useful for someone, but I could not install it as, pip did not find it and conda don't works really well here (I am in China, maybe my university network have the problem or VPN)

Deleting duplicates from List of dict elements (created from Twitter json objects)

5 Answers5