3

please help me find a solution that doesn't call for tons of loops. I have a list of timestamps, e.g.

["2014-04-11 08:00:00.000000",
 "2014-04-11 09:35:00.000000",
 "2014-04-11 09:35:00.000000",
 "2014-04-11 09:40:00.000000",
 "2014-04-11 11:00:00.000000",
 ...]

I would like to 'merge' timestamps in the list such that those within a common window (e.g. 10 minutes) of eachother become just one entry. So the above example list would become

["2014-04-11 08:00:00.000000",
 "2014-04-11 09:35:00.000000",
 "2014-04-11 11:00:00.000000",
 ...]

Notice also the three timestamps that merged did so to the "9:35" value rather than the "9:40". I would like merging timestamps to go to the most frequent entry. If there's a tie, merge on the earlier/most-frequent timestamp.

And I'm also trying to keep track of how many timestamps get merged. So a list keeping the count would be [1,3,1,...] for the above example.

BoltzmannBrain
  • 5,082
  • 11
  • 46
  • 79
  • Do you have something that partially works? – Simeon Visser Jan 21 '15 at 00:21
  • Also, what defines a common window? Can one create consecutive 10-minute windows and see if the timestamps fit in there or do the time windows have to be derived from the data? – Simeon Visser Jan 21 '15 at 00:33
  • @SimeonVisser the window for a given timestamp is defined as the 10-minute interval around it, where there are 5min before and 5min after the timestamp. – BoltzmannBrain Jan 21 '15 at 00:40
  • The partially working code is long and inefficient and confusing... it loops through the enumeration of the list of timestamps, during which it i. creates a list of timestamps that makeup the 10-min window ii. loops through the original list to count how many times each entry appears in the list from (i) iii. loops again through the original list, checking the difference between timestamps. If the difference is =< 5minutes, then the earlier timestamp is added to a new list (and in parallel another list keeps track of how many timestamps get merged into one entry in the new list). – BoltzmannBrain Jan 21 '15 at 00:43

4 Answers4

2

This can be solved as follows:

import datetime

data = ["2014-04-11 08:00:00.000000", "2014-04-11 09:35:00.000000", "2014-04-11 09:35:00.000000", "2014-04-11 09:40:00.000000", "2014-04-11 11:00:00.000000"]

delta = datetime.timedelta(minutes=10)
result = []
bucket = []
current = None
for item in data:
    datetime_obj = datetime.datetime.strptime(item, '%Y-%m-%d %H:%S:%M.%f')
    if current is None:
        current = datetime_obj
        bucket = [current]
        continue
    if (datetime_obj - current) <= delta:
        bucket.append(datetime_obj)
    else:
        result.append(bucket)
        current = datetime_obj
        bucket = [current]

if bucket:
    result.append(bucket)

for bucket in result:
    print(bucket)

Example:

>>> for bucket in result:
...     print(bucket)
...
[datetime.datetime(2014, 4, 11, 8, 0)]
[datetime.datetime(2014, 4, 11, 9, 0, 35), datetime.datetime(2014, 4, 11, 9, 0, 40)]
[datetime.datetime(2014, 4, 11, 11, 0)]

This result data structure can be used to compute the desired values: each of the timestamps that identifies the window and the number of timestamps available ("consumed") to create that window.

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • This did it, and is quite clean, thank you! I made a couple adjustments: (i) sort the timestamps with `datetime_obj.sort()`, and (ii) merge the buckets with `merged.append(max(bucket, key=bucket.count))` – BoltzmannBrain Jan 22 '15 at 08:11
1

Assuming the timestamps are sorted, what about...:

import datetime

def merged_ts(timestamps):
    merged_strings = []
    counts = []
    for ts in timestamps:
        dt = datetime.datetime.strptime(ts, '%Y-%m-%d %H:%M:%S.%f')
        if not merged_strings:  # first-time switch
            merged_string.append(ts)
            counts.append(1)
            oldt = dt
            continue
        dif = dt - oldt
        if dif.total_seconds < 300:  # 5 minutes
            counts[-1] += 1
            continue
        merged_string.append(ts)
        counts.append(1)
        oldt = dt
    return merged_strings, counts

Added: OP specified that the timestamps are not originally sorted (but may be sorted for the purpose), and if the timestamps were T, T+4minutes, T+8minutes, T+12minutes, etc, they'd have to merge into a single slot (w/appropriate count). Little change needed for this version, then...:

import datetime

def merged_ts(timestamps):
    merged_strings = []
    counts = []
    for ts in sorted(timestamps):
        dt = datetime.datetime.strptime(ts, '%Y-%m-%d %H:%M:%S.%f')
        if not merged_strings:  # first-time switch
            merged_string.append(ts)
            counts.append(1)
            oldt = dt
            continue
        dif = dt - oldt
        oldt = dt
        if dif.total_seconds < 300:  # 5 minutes
            counts[-1] += 1
        else:
            merged_string.append(ts)
            counts.append(1)
    return merged_strings, counts

I just added a sorted call, and moved the oldt = dt up to where it happens on every leg of the loop (except on the first-time switch) -- so that each new incoming ts will be checked vs the "closest" (newest) datestamp in the current bucket, rather than vs the "first" (oldest) one as before. (Only as a matter of style, I changed the conditional at the end to if/else rather than use a continue there, as the two legs of the conditional are now well-balanced).

First-time switches are goofy, but removing this one (without repeating the strptime takes slightly subtler code, such as:

if not timestamps: return [], []
it = iter(sorted(
    (ts,datetime.datetime.strptime(ts, '%Y-%m-%d %H:%M:%S.%f'))
    for ts in timestamps))
first = next(it)
merged_strings = [first[1]]
oldt = first[0]
counts = [1]
for ts, st in it:
    dif = dt - oldt
    oldt = dt
    if dif.total_seconds < 300:  # 5 minutes
        counts[-1] += 1
    else:
        merged_string.append(ts)
        counts.append(1)
return merged_strings, counts

The version with the first-time switch seems preferable to me, in this case, purely on stylistic grounds.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • The timestrings are not necessarily sorted, but I think this can be solved at the start with `timestamps.sort()`, or another method from [here](http://stackoverflow.com/questions/5166842/sort-dates-in-python-array). To be more clear on the behavior, timestamps of [T, T+4, T+8, T+12] should merge to become [T] where the count = 4. But for timestamps of [T, T+4, T+10, T+12], the result would be [T, T+10] with counts = [2, 2]. I hope that helps :) – BoltzmannBrain Jan 21 '15 at 14:53
  • OK, slightly different problem, editing to solve it. – Alex Martelli Jan 21 '15 at 15:03
0

If you haven't looked at it yet, converting this into a pandas DataFrame would prove efficient in this case.

One way to do it IMO is to make a dataframe with these timestamps as the index and the count as a column. Then, by looping over it once, you can drop those rows that are in the same common window (using datetime.timedelta or numpy.timedelta64) and updating the value of the column count for that row.

Having more information would help in providing a more detailed answer. For example, is your list sorted, and if not does it have to keep the same order it had before merging? (from your example it seems it is already sorted)

oxtay
  • 3,990
  • 6
  • 30
  • 43
0

You could use groupby with a special key that "switches" between groups. First prepare the data:

from itertools import groupby
from datetime import datetime

l = ["2014-04-11 08:00:00.000000",
 "2014-04-11 09:35:00.000000",
 "2014-04-11 09:35:00.000000",
 "2014-04-11 09:40:00.000000",
 "2014-04-11 11:00:00.000000"]
l = map(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S.%f"), l)

Now you can do:

class grouper():
    def __call__(self, d):
        if not hasattr(self, 'prev'):    # first element: init switch
            self.switch = 1
        elif (d - self.prev).total_seconds() > 10*60:    # 10min
            self.switch *= -1
        self.prev = d                    # save current value
        return self.switch


def most_common(group):
    lst = list(group)
    return lst[0]   # choose the first element in the group

>>> [most_common(g) for k, g in groupby(l, key = grouper())]
[datetime.datetime(2014, 4, 11, 8, 0),
 datetime.datetime(2014, 4, 11, 9, 35),
 datetime.datetime(2014, 4, 11, 11, 0)]

You can adjust the most_common function to fit your criteria.

elyase
  • 39,479
  • 12
  • 112
  • 119