Remove duplicate in datalist Python 2.7/Django

Question

example, I have a list called attendances that contain multiple data like:

[ <Attendance>: 11804 : 2018-07-18 12:22:55, <Attendance>: 11804 : 2018-07-18 12:23:04, <Attendance>: 2 : 2018-07-25 16:17:18, <Attendance>: 2 : 2018-07-25 16:17:20, <Attendance>: 2 : 2018-07-25 16:17:23, <Attendance>: 2 : 2018-07-25 16:27:52]

when I need to print it. I do simply:

for data in attendances:
    print 'User ID   : {}'.format(data.user_id)
    print 'Timestamp : {}'.format(data.timestamp)

result will be:

User ID   : 11804
Timestamp : 2018-07-18 12:22:55
User ID   : 11804
Timestamp : 2018-07-18 12:23:04
User ID   : 2
Timestamp : 2018-07-25 16:17:18
User ID   : 2
Timestamp : 2018-07-25 16:17:20
User ID   : 2
Timestamp : 2018-07-25 16:17:23
User ID   : 2
Timestamp : 2018-07-25 16:27:52

but that not what I need, since its print all the data. I need to only show only one and first data every User ID.

like this :

User ID   : 11804
Timestamp : 2018-07-18 12:22:55
User ID   : 2
Timestamp : 2018-07-25 16:17:18

any have idea what should I do?...

The first one is always the *earliest*? Can you share the (relevant parts of the) `Attendance` model? — Willem Van Onsem, Jul 25 '18 at 08:45
Create a set outside of your loop, inside the loop check if the user id already exists in the set, if it does -> skip, otherwise proceed and add the user id to your set. If you need to sort your list first (be it user id or date), refer to this answer: https://stackoverflow.com/a/403426/4349415 — Mike Scotty, Jul 25 '18 at 08:47

Willem Van Onsem · Accepted Answer · 2018-07-25T08:58:45.957

With a query

You can make a query such that you obtain a QuerySet containing dictionaries. In that case every dictionary contains a 'user_id' key, and a 'first_timestamp' key, like:

from django.db.models import Min

data =Attendance.objects.values('user_id').annotate(
    first_timestamp=Min('timestamp')
).order_by('user_id')

You can then enumerate the result, and print it like:

for data in attendances:
    print 'User ID   : {}'.format(data['user_id'])
    print 'Timestamp : {}'.format(data['timestamp'])

With a `set` that maintains the already seen users

In case it is not possible to write such query (you are given a list for example). We can perform a sorting first, and then maintain a set of already seen user ids:

from operator import attrgetter

sorted_attendances = sorted(attendances, key=attrgetter('timestamp'))
seen_users = set()

for attendance in sorted_attendances:
    if attendance.user_id not in seen_users:
        seen_users.add(attendance.user_id)
        print 'User ID   : {}'.format(data.user_id)
        print 'Timestamp : {}'.format(data.timestamp)

This approach is typically more expensive however, since the amount of data transferred by the database is larger, and thus is the amount of data to process.

you said `This approach is typically more expensive however`, any idea to make it more optimal?... — Dicky Raambo, Jul 25 '18 at 09:43
No, in terms of complexity this is optimal. The problem is more that due to the nature of Python, it easly runs 100 to 10'000 times slower, than a statically typed language like Haskell/C++, as a result it is advisable to do such processing not in Python, but for example use the database to do this, or have some interface to C++ algorithms (like with `numpy` or `pandas`, I think it might be doable in Pandas). — Willem Van Onsem, Jul 25 '18 at 09:46

Remove duplicate in datalist Python 2.7/Django

1 Answers1

With a query

With a set that maintains the already seen users

With a `set` that maintains the already seen users