Clustering / Grouping a list based on time (python)

Question

I have a list of lists that I want to group into separate lists based on clusters of time.

I can easily sort it based on the time, but I have not determined an easy way to group it together. I am fine with it being datetime / time format or text, either one works for me. I need to process the other data based on the cluster. This is a sample dataset that I might be working with.

[['asdf', '2012-01-01 00:00:12', '1234'],
 ['asdf', '2012-01-01 00:00:31', '1235'],
 ['asdf', '2012-01-01 00:00:57', '2345'],
 ['asdf', '2012-01-01 00:01:19', '2346'],
 ['asdf', '2012-01-01 00:01:25', '2345'],
 ['asdf', '2012-01-01 09:04:14', '3465'],
 ['asdf', '2012-01-01 09:04:34', '1613'],
 ['asdf', '2012-01-01 09:04:51', '8636'],
 ['asdf', '2012-01-01 09:05:15', '5847'],
 ['asdf', '2012-01-01 09:05:29', '3672'],
 ['asdf', '2012-01-01 09:05:30', '2367'],
 ['asdf', '2012-01-01 09:05:43', '9544'],
 ['asdf', '2012-01-01 14:48:15', '2572'],
 ['asdf', '2012-01-01 14:48:34', '7483'],
 ['asdf', '2012-01-01 14:48:56', '5782']]

The results should look something like this. A nested list of lists for each group.

[[['asdf', '2012-01-01 00:00:12', '1234'],
  ['asdf', '2012-01-01 00:00:31', '1235'],
  ['asdf', '2012-01-01 00:00:57', '2345'],
  ['asdf', '2012-01-01 00:01:19', '2346'],
  ['asdf', '2012-01-01 00:01:25', '2345']],
 [['asdf', '2012-01-01 09:04:14', '3465'],
  ['asdf', '2012-01-01 09:04:34', '1613'],
  ['asdf', '2012-01-01 09:04:51', '8636'],
  ['asdf', '2012-01-01 09:05:15', '5847'],
  ['asdf', '2012-01-01 09:05:29', '3672'],
  ['asdf', '2012-01-01 09:05:30', '2367'],
  ['asdf', '2012-01-01 09:05:43', '9544']],
 [['asdf', '2012-01-01 14:48:15', '2572'],
  ['asdf', '2012-01-01 14:48:34', '7483'],
  ['asdf', '2012-01-01 14:48:56', '5782']]]

The clusters are of no set size, and no set times. They can occur randomly throughout the day, and will need to cluster based on a large gap in the time.

The first group happens right after midnight and has 5 entries, the next one is centered around 09:05 and has 7 entries. The final one happens about 14:48 and only has 3 entries. I could also have two groups at either end of the hour as well, so I can not just group by the hour.

I have already sorted and grouped the data by the first field in the list, I just need to break them down into smaller chunks to process. I am willing the change the date to whatever format is necessary to get the grouping done as it is a key part of the analysis I am doing on the data.

I would prefer to keep the solution within the basic python libraries, but if there is no solution I can attempt to get other packages.

I have already looked at solutions here, here, here, here, and many others but none of them address the random nature of these times.

Splitting the list at any gap greater than X time would be a great solution, so I can change X to 5 or 10 minutes, whatever is deemed appropriate. Dropping any group that has length less than 3 would also be a bonus, but can easily be done at the end.

My only real idea right now is to loop through the list compare the current time with the new time and split the list that way, but it seems like a very inefficient way of solving this problem when there are millions of records to sort and group.

Any help would be greatly appreciated. If any of this doesn't make sense I will do my best to clarify.

It appears you're asking for a grouping of buckets per hour. This style would return the results you're expecting. It's also the exact method displayed in the first SO link you posted: http://stackoverflow.com/questions/2344639/python-group-results-by-time-intervals. — VooDooNOFX, Nov 21 '13 at 04:13
Could you please explain `I could also have two groups at either end of the hour as well, so I can not just group by the hour.`? — thefourtheye, Nov 21 '13 at 04:43
@thefourtheye: [OP might mean something like that](https://gist.github.com/zed/76e94b0b2b55d3be536b) (it is inefficient but it should produce the intended result). — jfs, Nov 21 '13 at 05:08
Your example has the data sorted by time - is your real data sorted by time? — wwii, Nov 21 '13 at 05:36
@VooDooNOFX & thefourtheye No, I am not trying to group by hour because I could have two separate groups within one hour. Meaning 00:00:10,20,30 and 00:59:00,10,20 due to the large gap they need to be separate groups. I could also have 00:59:50, 01:00:00, and 01:00:10 which would be one group. — eseglem, Nov 21 '13 at 11:50
@wwii My data does come in chronological order but I have done a data.sort() on the date time field in order to guarantee that it is sorted by this point in time. — eseglem, Nov 21 '13 at 11:51
Your script cannot guess what you mean by an arbitrarily large or small grouping timeframe. You must specify the maximum 2 times can differ to still be counted as part of the same "grouping". Some answers use this as an hour, but perhaps you mean minute, tens of minutes or some other arbitrary interval? — VooDooNOFX, Nov 21 '13 at 11:53

DSM · Accepted Answer · 2013-11-21T04:56:39.037

10

If we split at time differences beyond some limit, then something like

# turn strings into datetimes
date_format = "%Y-%m-%d %H:%M:%S"
for row in data:
    row[1] = datetime.datetime.strptime(row[1], date_format)

split_dt = datetime.timedelta(minutes=5)
dts = (d1[1]-d0[1] for d0, d1 in zip(data, data[1:]))
split_at = [i for i, dt in enumerate(dts, 1) if dt >= split_dt]
groups = [data[i:j] for i, j in zip([0]+split_at, split_at+[None])]

might work. (Beware of fencepost errors, though.. I make them too easily!)

edited Nov 21 '13 at 04:56

answered Nov 21 '13 at 04:51

DSM

342,061
65
592
494

1

This appears to do exactly what I was looking for. I can even add "if len(data[i:j]) > 2" in the final list comprehension to eliminate groups that are too small. I can't upvote the answer, but I have accepted it. – eseglem Nov 21 '13 at 14:39
@DSM - I put your script in a function, [http://dpaste.com/1477229/](http://dpaste.com/1477229/), and it is modifying the global list. I can't figure out why this is happening. I'd like to understand this, any ideas? – wwii Nov 21 '13 at 17:00
looks like the "if len(data[i:j]) > 2" doesn't work after all, but "groups = [g for g in groups if len(g) > 2]" cleans it up – eseglem Nov 21 '13 at 17:03
1

@wwii: `data = data[:]` in your code only makes a shallow copy. @eseglem: `[data[i:j] for i, j in zip([0]+split_at, split_at+[None]) if len(data[i:j]) > 5]` should work, despite the ugly duplication. I think it's cleaner to do what you did, though, and separate the construction of the clusters from the filtering on size. – DSM Nov 21 '13 at 17:08
@wwii isn't it because you are passing a reference to DATA as data, and then modifying data which in turn just points back to the global? ... not the greatest wording but I believe data is a reference to DATA not a copy of it – eseglem Nov 21 '13 at 17:09
@DSM it appears to just be returning an empty list if I do that. The syntax might just be off a little. The other way works and, as you said, it seems to be a bit cleaner. Thanks for all the help. – eseglem Nov 21 '13 at 17:14
@DSM - Thnx, I knew that ```data = [[a, datetime.datetime.strptime(b, date_format), c] for a,b,c in data]``` – wwii Nov 21 '13 at 17:42

score 3 · Answer 2 · answered Nov 21 '13 at 06:07

I'm not going to solve your problem, but I'll try to make you feel better about what you already know ;-)

Forget all the details of your problem and think about a list of plain integers instead. Say you want to break it into groups via gaps of at least 5. Here's the list:

[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, ...]

Oops! Every element is in its own group then, and there's simply no way to know that without comparing every adjacent pair of elements. Think about it. So:

My only real idea right now is to loop through the list compare the current time with the new time and split the list that way, but it seems like a very inefficient way of solving this problem when there are millions of records to sort and group.

In the example above, that's the best that can be done! It takes time linear in the number of elements, which is rarely considered "very inefficient".

Now in some cases it's certainly possible to do better. Let's change the list above to:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...]

Again with gap 5, there's only one group in total. Can that be discovered using less than a number of compares proportional to the length of the list? Maybe, using variants of binary search, it would be possible to discover that using a number of compares proportional to the logarithm of the length of the list. But details are everything here and they're tricky. So tricky that I dread adapting them to your messier problem.

And, in the end, unless you have very large groups, I expect it would actually be slower than doing an obvious thing! DSM's answer uses efficient and more-or-less straightforward Python idiom; a complex algorithm that needs to keep track of many little details generally runs slower (even if it has far better theoretical O() behavior) unless applied to very favorable cases.

So be happy with a straightforward loop you understand at a glance :-)

wwii · Answer 3 · 2013-11-21T06:18:25.563

... loop through the list compare the current time with the new time and split the list that way

Seems like that's the way to do it. Using itertools.groupyby() (J. F. Sebastian's comment)
might scale better but this seems to compete using the 15 rows provided.

def grp(data, dHours, dMinutes, dSeconds):

    delta = datetime.timedelta(hours = dHours, minutes = dMinutes, seconds = dSeconds)
    final = list()
    tmp = list()
    date_format = "%Y-%m-%d %H:%M:%S"

    tmp.append(data[0])
    previous = datetime.datetime.strptime(data[0][1], date_format)

    for row in data[1:]:
        dt = datetime.datetime.strptime(row[1], date_format)
        if dt - previous > delta:
            #if len(tmp) > 2:
            final.append(tmp)
            tmp = list()
        tmp.append(row)
        previous = dt

    final.append(tmp)
    return final

score 1 · Answer 4 · answered Nov 21 '13 at 04:28

Not the most elegant perhaps, but something like this should work:

In [1]: from itertools import groupby

In [2]: d = [['asdf',1],
   ...:      ['asdf',2],
   ...:      ['asdf',5],
   ...:      ['asdf',6],
   ...:      ['asdf',7],
   ...:      ['asdf',20]]

In [3]: t = [x[1] for x in d]

In [4]: diff = [0] + [t[i+1] - t[i] for i in range(len(t)-1)]

In [5]: i = 0

In [6]: key = []

In [7]: for x in diff:
   ...:     if x > 2:
   ...:         i += 1
   ...:     key.append(i)
   ...:

In [8]: [zip(*list(g))[0] for k, g in groupby(zip(d,key), lambda x: x[1])]
Out[8]:
[(['asdf', 1], ['asdf', 2]),
 (['asdf', 5], ['asdf', 6], ['asdf', 7]),
 (['asdf', 20],)]

Of course you will have to parse the date strings to get a sensible time difference.

score 1 · Answer 5 · answered Dec 28 '19 at 15:33

Here's another way to do this that I recently learned, using defaultdict. You can adapt this easily for further grouping by minutes, seconds, etc!

from collections import defaultdict

mylist = [['asdf', '2012-01-01 00:00:12', '1234'],
 ['asdf', '2012-01-01 00:00:31', '1235'],
 ['asdf', '2012-01-01 00:00:57', '2345'],
 ['asdf', '2012-01-01 00:01:19', '2346'],
 ['asdf', '2012-01-01 00:01:25', '2345'],
 ['asdf', '2012-01-01 09:04:14', '3465'],
 ['asdf', '2012-01-01 09:04:34', '1613'],
 ['asdf', '2012-01-01 09:04:51', '8636'],
 ['asdf', '2012-01-01 09:05:15', '5847'],
 ['asdf', '2012-01-01 09:05:29', '3672'],
 ['asdf', '2012-01-01 09:05:30', '2367'],
 ['asdf', '2012-01-01 09:05:43', '9544'],
 ['asdf', '2012-01-01 14:48:15', '2572'],
 ['asdf', '2012-01-01 14:48:34', '7483'],
 ['asdf', '2012-01-01 14:48:56', '5782']]

record_dict = defaultdict(list)

for item in mylist: 
    date_time = item[1]
    date_time2 = date_time.split(" ")
    date_time3 = date_time2[1].split(":")
    date_time4 = date_time3[0]
    record_dict[date_time4].append(item)

res_list = list(record_dict.values())

print(res_list)

Output:

OUTPUT:
[

[['asdf', '2012-01-01 00:00:12', '1234'], ['asdf', '2012-01-01 00:00:31', '1235'], 
['asdf', '2012-01-01 00:00:57', '2345'], ['asdf', '2012-01-01 00:01:19', '2346'], 
['asdf', '2012-01-01 00:01:25', '2345']], 

[['asdf', '2012-01-01 09:04:14', '3465'], ['asdf', '2012-01-01 09:04:34', '1613'], 
['asdf', '2012-01-01 09:04:51', '8636'], ['asdf', '2012-01-01 09:05:15', '5847'], 
['asdf', '2012-01-01 09:05:29', '3672'], ['asdf', '2012-01-01 09:05:30', '2367'], 
['asdf', '2012-01-01 09:05:43', '9544']], 

[['asdf', '2012-01-01 14:48:15', '2572'], ['asdf', '2012-01-01 14:48:34', '7483'], 
['asdf', '2012-01-01 14:48:56', '5782']],

]

Clustering / Grouping a list based on time (python)

5 Answers5