1

What is the fastest way to "trim" a dictionary based on they key? My understanding is that dictionaries now preserve order since Python 3.7

I have a dictionary that contains key (type datetime): val (type float). The dictionary is in a sorted (chronological) order.

time_series_dict = 
{"2019-02-27 14:00:00": 95,
"2019-02-27 15:00:00": 98,
"2019-02-27 16:25:00: 80,
.............
"2019-03-01 12:15:00": 85
}

I would like to trim the dictionary, removing everything outside of start_date and end_date. Dictionary can have 1000s of values. Is there a faster method than:

for k in list(time_series_dict.keys()):
    if not start_date <= k <= end_date:
        del time_series_dict[k]
Borisw37
  • 739
  • 2
  • 7
  • 30
  • Maybe use a different data structure, like a binary search tree? Java has `TreeMap`, but AFAIK there's [nothing like that](https://stackoverflow.com/q/17857496/1639625) in the Python standard library. – tobias_k Feb 27 '19 at 14:52

2 Answers2

3

If your dictionaries have 1000s of keys, and you are removing keys from the start and end of the ordered sequence of timestamps, consider using binary search to find the cut-off points in a list copy of the keys. Python includes the bisect module for this:

from bisect import bisect_left, bisect_right

def trim_time_series_dict(tsd, start_date, end_date):
    ts = list(tsd)
    before = bisect_right(ts, start_date)  # insertion point at > start_date
    after = bisect_left(ts, end_date)      # insertion point is < end_date
    for i in range(before):                # up to == start_date
        del tsd[ts[i]]
    for i in range(after + 1, len(ts)):    # from >= end_date onwards
        del tsd[ts[i]]

I've run some time trials to see if this is going to make a difference against your typical datasets; as expected, it pays off when the number of keys removed is significantly lower than the length of the input dictionary.

Time trial setup (imports, building the test data dictionary and start and end dates, defining the test functions)

>>> import random
>>> from bisect import bisect_left, bisect_right
>>> from datetime import datetime, timedelta
>>> from itertools import islice
>>> from timeit import Timer
>>> def randomised_ordered_timestamps():
...     date = datetime.now().replace(second=0, microsecond=0)
...     while True:
...         date += timedelta(minutes=random.randint(15, 360))
...         yield date.strftime('%Y-%m-%d %H:%M:%S')
...
>>> test_data = {ts: random.randint(50, 500) for ts in islice(randomised_ordered_timestamps(), 10000)}
>>> start_date = next(islice(test_data, 25, None))                 # trim 25 from the start
>>> end_date = next(islice(test_data, len(test_data) - 25, None))  # trim 25 from the end
>>> def iteration(t, start_date, end_date):
...     time_series_dict = t.copy()  # avoid mutating test data
...     for k in list(time_series_dict.keys()):
...         if not start_date <= k <= end_date:
...             del time_series_dict[k]
...
>>> def bisection(t, start_date, end_date):
...     tsd = t.copy()  # avoid mutating test data
...     ts = list(tsd)
...     before = bisect_right(ts, start_date)  # insertion point at > start_date
...     after = bisect_left(ts, end_date)      # insertion point is < end_date
...     for i in range(before):                # up to == start_date
...         del tsd[ts[i]]
...     for i in range(after + 1, len(ts)):    # from >= end_date onwards
...         del tsd[ts[i]]
...

Trial outcome:

>>> count, total = Timer("t.copy()", "from __main__ import test_data as t").autorange()
>>> baseline = total / count
>>> for test in (iteration, bisection):
...     timer = Timer("test(t, s, e)", "from __main__ import test, test_data as t, start_date as s, end_date as e")
...     count, total = timer.autorange()
...     print(f"{test.__name__:>10}: {((total / count) - baseline) * 1000000:6.2f} microseconds")
...
 iteration: 671.33 microseconds
 bisection:  80.92 microseconds

(The test subtracts the base-line cost of making a dict copy first).

However, there may well be more efficient data structures for these kind of operations. I checked out the sortedcontainers project as it includes a SortedDict() type that supports bisection on the keys directly. Unfortunately, while it performs better than your iteration approach, I can't make it perform better here than bisecting on a copy of the keys list:

>>> from sortedcontainers import SortedDict
>>> test_data_sorteddict = SortedDict(test_data)
>>> def sorteddict(t, start_date, end_date):
...     tsd = t.copy()
...     # SortedDict supports slicing on the key view
...     keys = tsd.keys()
...     del keys[:tsd.bisect_right(start_date)]
...     del keys[tsd.bisect_left(end_date) + 1:]
...
>>> count, total = Timer("t.copy()", "from __main__ import test_data_sorteddict as t").autorange()
>>> baseline = total / count
>>> timer = Timer("test(t, s, e)", "from __main__ import sorteddict as test, test_data_sorteddict as t, start_date as s, end_date as e")
>>> count, total = timer.autorange()
>>> print(f"sorteddict: {((total / count) - baseline) * 1000000:6.2f} microseconds")
sorteddict: 249.46 microseconds

I may be using the project wrong, however. Deleting keys from SortedDict objects is O(NlogN) so I suspect that that's where this falls down. Creating a new SortedDict() object from the other 9950 key-value pairs is slower still (over 2 milliseconds, not something you want to compare against the other approaches).

However, if you were to use the SortedDict.irange() method you can simply ignore values, not delete them, and iterate over a sub-set of dictionary keys:

for ts in timeseries(start_date, end_date, inclusive=(False, False)):
    # iterates over all start_date > timestamp > end_date keys, in order.

eliminating the need to delete anything. The irange() implementation uses bisection under the hood.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • With `list(tsd)` won't you have the same O(n) as with just iterating and checking all the keys? Granted: You have to do _much_ fewer actual comparisons. – tobias_k Feb 27 '19 at 15:24
  • @tobias_k: yes, `list(tsd)` will linearly scale with the size of the dictionary, but it is really worth avoiding adding the cost of O(n) comparisons here. It's constant costs that make the difference here. – Martijn Pieters Feb 27 '19 at 15:28
  • mine was about 32% slower ;) – Patrick Artner Feb 27 '19 at 15:38
-1
import time

import timeit

print(timeit.timeit(setup="""import datetime
time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t
        """,
stmt="""
tmp = time_series_dict.copy()
for k in list(tmp.keys()):
    if not start_time <= k <= end_time:
        del tmp[k]

""",
number=10000
))
print(timeit.timeit(setup="""import datetime
time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t
""",
stmt="""
tmp = time_series_dict.copy()
result = {}
for k in list(tmp.keys()):
    if start_time <= k <= end_time:
        result[k] = tmp[k]
""",
number=10000
))
print(timeit.timeit(setup="""
import datetime
from bisect import bisect_left, bisect_right

time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t

""",
stmt="""
tmp = time_series_dict.copy()
def trim_time_series_dict(tsd, start_date, end_date):
    ts = list(tsd)
    before = bisect_right(ts, start_date)  # insertion point at > start_date
    after = bisect_left(ts, end_date)      # insertion point is < end_date
    for i in range(before):                # up to == start_date
        del tsd[ts[i]]
    for i in range(after + 1, len(ts)):    # from >= end_date onwards
        del tsd[ts[i]]

trim_time_series_dict(tmp, start_time, end_time)
""",
number=10000
))

test result

12.558672609
9.662761111
7.990544049
Leo_Liu_MJ
  • 179
  • 1
  • 4
  • Please use the `timeit` module to do time trials, it takes care of repeating, picking the best clock to measure how long the operations take, and it also disables the GC to minimise, where possible, processes that can interfere with taking measurements. – Martijn Pieters Feb 27 '19 at 15:34
  • By not repeating your tests, you've made them highly susceptible to other processes on your computer taking CPU time away just at the wrong moment. You can't trust your timings at all now. I highly doubt that creating a *new* dictionary is 25% faster, for example. It is rather trivial to lose 470 microseconds somewhere else on your system, from time to time. – Martijn Pieters Feb 27 '19 at 15:35
  • “Teach others is the best way to learn". Thanks for your help I have update the code using `timeit` and the result is amazing. Did I do something wrong ? – Leo_Liu_MJ Feb 27 '19 at 16:05
  • Yes, you manipulate `tmp` (your copy) directly still, then assign the empty `result` dict to `tmp`. `tmp[k] = tmp[k]` is 'fast' because no new space has to be allocated. You want to assign with `result[k] = tmp[k]` in the loop. – Martijn Pieters Feb 27 '19 at 16:28
  • The bisection test differs in that you added an extra function call into the mix, where the first two time trials do not make such function calls to achieve their work. – Martijn Pieters Feb 27 '19 at 16:30