Optimal/fastest way to count dates in a python list

Question

I have a list of dates and the goal is to count the occurrences of each date while maintaining the order in which they appear in the original list. Consider the following example:

The list only_dates looks like this:

[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]

I am trying to use groupby:

import itertools
day_wise_counts = [(k, len(list(g))) for k, g in itertools.groupby(only_dates)]
print(str(day_wise_counts))

This prints

[(datetime.date(2017, 3, 10), 1), (datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 1), (datetime.date(2017, 3, 11), 1)]

I understand this is happening because ultimately each date object is being treated as a different one while grouping.

I was expecting the output to be:

[(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]

I am not necessarily looking for a list of tuples. A dictionary output will also suffice as long as the original order of dates is maintained. (OrderedDict maybe).

How can I achieve this?

Update: There are possible multiple approaches being suggested which all work well. But I should have mentioned that I'll be doing this operation for a large amount data. So it would be great if your solution is optimal one in terms of running time. Please edit your answer/comment accordingly if you can.

Update 2: The size of data can be as large as 1 million rows.

If you're using python-2.x you could checkout this question: https://stackoverflow.com/questions/35446015/creating-an-ordered-counter how you could create an ordered Counter. Unfortunatly that doesn't work in python-3.x anymore (except for 3.6 where `dict` keeps it's order by default). — MSeifert, Jun 12 '17 at 14:08
What sizes (and roughly what percentage of duplicates) are we talking about if you say "I'll be doing this operation for a large amount data."? — MSeifert, Jun 12 '17 at 14:10
Possible duplicate of [how to get count dict of items but maintain the order in which they appear?](https://stackoverflow.com/questions/23747564/how-to-get-count-dict-of-items-but-maintain-the-order-in-which-they-appear) — Chris_Rands, Jun 12 '17 at 14:11
@Chris_Rands it does not address the performance requirement. — Ravindra S, Jun 12 '17 at 14:18
@PaleBlueDot It's still a dupe, but the answers here are better. If anything, that question could be linked to this one, I'll leave the moderators to decide this — Chris_Rands, Jun 12 '17 at 15:02

MSeifert · Accepted Answer · 2017-06-12T14:58:07.997

Indeed, you could use an OrderedDict:

from collections import OrderedDict
import datetime

inp = [datetime.date(2017, 3, 9), datetime.date(2017, 3, 10),
       datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]

odct = OrderedDict()
for item in inp:
    try:
        odct[item] += 1
    except KeyError:
        odct[item] = 1

print(odct)

which prints:

OrderedDict([(datetime.date(2017, 3, 9), 1),
             (datetime.date(2017, 3, 10), 2),
             (datetime.date(2017, 3, 11), 1)])

You also asked for timings, so here they are:

from collections import OrderedDict, Counter
import datetime
import random

# Functions

def ordereddict(inp):
    odct = OrderedDict()
    for item in inp:
        try:
            odct[item] += 1
        except KeyError:
            odct[item] = 1
    return odct


def dawg(inp):
    cnts=Counter(inp)
    seen=set()
    return [(e, cnts[e]) for e in inp if not (e in seen or seen.add(e))]


def chris1(inp):
    return [(item, inp.count(item)) for item in list(OrderedDict.fromkeys(inp))]


def chris2(inp):
    c = Counter(inp)
    return [(item,c[item]) for item in list(OrderedDict.fromkeys(inp))]


# Taken from answer: https://stackoverflow.com/a/23747652/5393381
class OrderedCounter(Counter, OrderedDict):  
    'Counter that remembers the order elements are first encountered'

    def __repr__(self):
        return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))

    def __reduce__(self):
        return self.__class__, (OrderedDict(self),)


# Timing setup
timings = {ordereddict: [], dawg: [], chris1: [], chris2: [], OrderedCounter: []}
sizes = [2**i for i in range(1, 20)]

# Timing
for size in sizes:
    func_input = [datetime.date(2017, random.randint(1, 12), random.randint(1, 28)) for _ in range(size)]
    for func in timings:
        res = %timeit -o func(func_input)   # if you use IPython, otherwise use the "timeit" module
        timings[func].append(res)

and plotted:

%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(1)
ax = plt.subplot(111)

for func in timings:
    ax.plot([2**i for i in range(1, 20)], 
            [time.best for time in timings[func]], 
            label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()

I timed it on Python-3.5. The approaches using Counter will likely be a bit slower on python-2.x (Counter was optimized for python-3.x). Also the chris2 and dawg approach overlap each other (because there is almost no time difference between them).

So except for the first approach of @Chris_Rands and the OrderedCounter - the approaches perform very similar and mostly depend on the number of duplicates in your list.

It's mostly a factor of 1.5-2 difference. I couldn't find any real time difference for 1 million items bwteen the 3 "fast" approaches.

Nice, benhmark! How about the `OrderedCounter`? https://stackoverflow.com/questions/23747564/how-to-get-count-dict-of-items-but-maintain-the-order-in-which-they-appear/23747652#23747652 — Chris_Rands, Jun 12 '17 at 14:50
Try `sorted([(k,v) for k,v in Counter(dates).items()], key=lambda t: dates.index(t[0])) ` I think that may be the fastest... — dawg, Jun 12 '17 at 15:05
@dawg I could measure it but sorting by the `index` introduces a tight correlation to the order of the input (that's a parameter that all other approaches avoided). In the best case it could be amazingly fast but in the worst case it's slow (`O(n**2)`). For example: `l = [datetime.date(2017, random.randint(1, 12), random.randint(1, 28)) for _ in range(2**19)]` then `%timeit dawg2(l)` gives `157 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)` but `l2 = sorted(l)` and `%timeit dawg2(l2)` gives `13.9 s ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`. — MSeifert, Jun 12 '17 at 15:17
So it could be 2-3 times faster but it could also be 50-100 times slower (compared to the other "fast" approaches"). — MSeifert, Jun 12 '17 at 15:21

Chris_Rands · Answer 2 · 2017-06-12T15:04:04.093

You could use list.count() with a a list comprehension iterating over a list derived from an OrderedDict of unique ordered dates:

import datetime
from collections import OrderedDict

lst = [datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]

[(item,lst.count(item)) for item in list(OrderedDict.fromkeys(lst))]
# [(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]

Or similarly using a collections.Counter instead of list.count:

from collections import Counter

c = Counter(lst)

[(item,c[item]) for item in list(OrderedDict.fromkeys(lst))]
# [(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]

Or use an OrderedCounter.

EDIT: see the excellent benchmark by @MSeifert.

Looks good. +1 Please address the performance part of the problem if you can. — Ravindra S, Jun 12 '17 at 14:19

dawg · Answer 3 · 2017-06-12T15:18:40.427

You can use a Counter to count then uniqify the original list to maintain order while adding the count.

Given:

>>> dates=[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]

You can do:

from collections import Counter

cnts=Counter(dates)
seen=set()
>>> [(e, cnts[e]) for e in dates if not (e in seen or seen.add(e))]
[(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]

Update

You can also sort a Counter back into the order of the original list by using a key function to get the index of the first entry of date(X) in that list:

sorted([(k,v) for k,v in Counter(dates).items()], key=lambda t: dates.index(t[0]))

(The speed of this is correlated to how ordered or unordered your list is...)

Did someone say timeit!

Here are some timings with a larger example (400,000 dates):

from __future__ import print_function
import datetime
from collections import Counter
from collections import OrderedDict

def dawg1(dates):
    seen=set()
    cnts=Counter(dates)
    return [(e, cnts[e]) for e in dates if not (e in seen or seen.add(e))]

def od_(dates):    
    odct = OrderedDict()
    for item in dates:
        try:
            odct[item] += 1
        except KeyError:
            odct[item] = 1
    return odct

def lc_(lst):
    return [(item,lst.count(item)) for item in list(OrderedDict.fromkeys(lst))]    

def dawg2(dates):
    return sorted([(k,v) for k,v in Counter(dates).items()], key=lambda t: dates.index(t[0]))    

if __name__=='__main__':
    import timeit  
    dates=[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]*100000
    for f in (dawg, od_, lc_,sort_):
        print("   {:^10s}{:.4f} secs {}".format(f.__name__, timeit.timeit("f(dates)", setup="from __main__ import f, dates", number=100),f(dates)))

Prints (on Python 2.7):

 dawg1   10.7253 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
  od_    21.8186 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
  lc_    17.0879 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
 dawg2   8.6058 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]0000)]

PyPy:

 dawg1   7.1483 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
  od_    4.7551 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
  lc_    27.8438 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
 dawg2   4.7673 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]

Python 3.6:

 dawg1   3.4944 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
  od_    4.6541 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
  lc_    2.7440 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
 dawg2   2.1330 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]

Best.

Looks good. Please see my update to the question regarding performance. +1 — Ravindra S, Jun 12 '17 at 14:14
Upvoted because you've put a lot into this solution, but I don't think using `sorted` is good because it may not necessarily preserve the original order. Also your benchmark is useful because it explores different Python versions, but MSeifert explores more parameter space — Chris_Rands, Jun 12 '17 at 15:08
@Chris_Rands: Thanks. The `sorted` version is using the index of the original list so how would it not preserve original order? It will be just as predictable as using any other method to preserve order. — dawg, Jun 12 '17 at 15:12

Optimal/fastest way to count dates in a python list

3 Answers3