Python - How to deduplicate a list of tuples, by only keeping the most recent tuples.

Question

I have a dataset where each record contains the date the user tweeted, their screenname, their follower count, and their friend count. Users can be listed multiple times throughout the entire dataset, and at different times as well as with different follower/friend counts at these various times. What I would like to do is to get a unique list of users in the list and their most recent follower/friend count. I do not want to just de-duplicate on their screenname, but instead I want their most recent values.

This is what my data currently looks like with duplicate values

In [14]: data
Out[14]: 
[(datetime.datetime(2014, 11, 21, 1, 16, 2), u'AlexMatosE', 773, 560),
 (datetime.datetime(2014, 11, 21, 1, 17, 6), u'hedofthebloom', 670, 618),
 (datetime.datetime(2014, 11, 21, 1, 18, 8), u'hedofthebloom', 681, 615),
 (datetime.datetime(2014, 11, 21, 1, 19, 1), u'jape2116', 263, 540),
 (datetime.datetime(2014, 11, 21, 1, 19, 3), u'_AlexMatosE', 790, 561),
 (datetime.datetime(2014, 11, 21, 1, 19, 5), u'Buffmuff69', 292, 270),
 (datetime.datetime(2014, 11, 21, 1, 20, 1), u'steveamodu', 140, 369),
 (datetime.datetime(2014, 11, 21, 1, 20, 9), u'jape2116', 263, 540),
 (datetime.datetime(2014, 11, 21, 1, 21, 3), u'chighway', 363, 767),
 (datetime.datetime(2014, 11, 21, 1, 22, 9), u'jape2116', 299, 2000)]

This is how I can get the unique users in the data

In [15]: users = set(sorted([line[1] for line in data]))

Now I need to figure out how to get the MOST RECENT set of values for each unique users in the dataset. I'm not sure if a for-loop is the best way to go here or if something else would be better.

In [18]: most_recent_user_data = [] 
   ....: for line in data:
   ....:     if line[1] in users:
   ....:         ...
   ....:         ...
   ....:         ...
   ....:         most_recent_user_data.append((line[1], line[2], line[3]))

Ultimate, I want to end up with each unique user once, and their MOST RECENT followers/friends value

In [19]: most_recent_user_data
Out[19]: 
 (u'hedofthebloom', 681, 615),
 (u'_AlexMatosE', 790, 561),
 (u'Buffmuff69', 292, 270),
 (u'steveamodu', 140, 369),
 (u'chighway', 363, 767),
 (u'jape2116', 299, 2000)]

Have you tried grouping by user, sorting by timestamp, and getting the most recent one? — chapelo, Dec 19 '14 at 02:33

score 1 · Answer 1 · answered Dec 19 '14 at 06:40

You can use groupby function in itertools module:

import datetime
import itertools

data = [(datetime.datetime(2014, 11, 21, 1, 16, 2), u'AlexMatosE', 773, 560),
        (datetime.datetime(2014, 11, 21, 1, 17, 6), u'hedofthebloom', 670, 618),
        (datetime.datetime(2014, 11, 21, 1, 18, 8), u'hedofthebloom', 681, 615),
        (datetime.datetime(2014, 11, 21, 1, 19, 1), u'jape2116', 263, 540),
        (datetime.datetime(2014, 11, 21, 1, 19, 3), u'_AlexMatosE', 790, 561),
        (datetime.datetime(2014, 11, 21, 1, 19, 5), u'Buffmuff69', 292, 270),
        (datetime.datetime(2014, 11, 21, 1, 20, 1), u'steveamodu', 140, 369),
        (datetime.datetime(2014, 11, 21, 1, 20, 9), u'jape2116', 263, 540),
        (datetime.datetime(2014, 11, 21, 1, 21, 3), u'chighway', 363, 767),
        (datetime.datetime(2014, 11, 21, 1, 22, 9), u'jape2116', 299, 2000)]

# sorted record by name and datetime
data = sorted(data, key=lambda x: (x[1], x[0]), reverse=True)

# group by username and get the most recent user data
most_recent_user_data = [[(lambda x: (x[1], x[2], x[3]))(next(v)) for k, v in itertools.groupby(data, key=lambda x: x[1])]]

result:

[('steveamodu', 140, 369),
 ('jape2116', 299, 2000), 
 ('hedofthebloom', 681, 615),
 ('chighway', 363, 767), 
 ('_AlexMatosE', 790, 561),
 ('Buffmuff69', 292, 270), 
 ('AlexMatosE', 773, 560)]

score 0 · Answer 2 · answered Dec 19 '14 at 02:42

One way would be to use dictionaries and use usernames as keys. For each key, you would have a list of user data, which you could sort as you want. The following is one way of doing this:

from collections import defaultdict

# move data to a dict
dataDict = defaultdict(list)

for v in data:
    dataDict[v[1]] += [v]

# sort user data for each user/key 
for u,v in dataDict.items():
    dataDict[u] = sorted(v, reverse=True)   

# get first (i.e. most recent) values for each user       
for u,v in dataDict.items():
    print(u,v[0][-2], v[0][-1])

The result is:

(u'chighway', 363, 767)
(u'AlexMatosE', 773, 560)
(u'hedofthebloom', 681, 615)
(u'steveamodu', 140, 369)
(u'Buffmuff69', 292, 270)
(u'_AlexMatosE', 790, 561)
(u'jape2116', 299, 2000)

score 0 · Answer 3 · answered Dec 19 '14 at 02:59

0

Using a dictionary to store the latest data for each user.

latests = {}
for d in data:
    if d[0] > latests.setdefault(d[1], d)[0]:
        latests[d[1]] = d

results = [(d[1], d[2:]) for d in latests.values()]
from pprint import pprint
pprint(results)

answered Dec 19 '14 at 02:59

Brendan Abel

35,343
14
88
118

score 0 · Answer 4 · edited May 23 '17 at 12:22

An alternative way to get the desired result:

from operator import itemgetter

# sort the data using time as the key
data.sort(key=itemgetter(0), reverse=True)

# remove duplicated users from the data
def uniq(seq):
    seen = set()
    seen_add = seen.add
    return [(x[1], x[2], x[3]) for x in seq if not (x[1] in seen or seen_add(x[1]))]

uniq(data)

which gives:

[('jape2116', 299, 2000),
 ('chighway', 363, 767),
 ('steveamodu', 140, 369),
 ('Buffmuff69', 292, 270),
 ('_AlexMatosE', 790, 561),
 ('hedofthebloom', 681, 615),
 ('AlexMatosE', 773, 560)]

I'm using the method mentioned in this thread.

score 0 · Answer 5 · answered Dec 19 '14 at 21:32

You sort your dataset in reverse time order and add to a dictionary or append to a list only the first time a user shows up:

import datetime    
users = {}
for d in reversed(data):
    if d[1] not in users: users[d[1]] = tuple(d[2:])

# {'_AlexMatosE': (790, 561), 'steveamodu': (140, 369), 'jape2116': (299, 2000), 'chighway': (363, 767), 'AlexMatosE': (773, 560), 'hedofthebloom': (681, 615), 'Buffmuff69': (292, 270)}

Python - How to deduplicate a list of tuples, by only keeping the most recent tuples.

5 Answers5