I have a list of lists that I want to group into separate lists based on clusters of time.
I can easily sort it based on the time, but I have not determined an easy way to group it together. I am fine with it being datetime / time format or text, either one works for me. I need to process the other data based on the cluster. This is a sample dataset that I might be working with.
[['asdf', '2012-01-01 00:00:12', '1234'],
['asdf', '2012-01-01 00:00:31', '1235'],
['asdf', '2012-01-01 00:00:57', '2345'],
['asdf', '2012-01-01 00:01:19', '2346'],
['asdf', '2012-01-01 00:01:25', '2345'],
['asdf', '2012-01-01 09:04:14', '3465'],
['asdf', '2012-01-01 09:04:34', '1613'],
['asdf', '2012-01-01 09:04:51', '8636'],
['asdf', '2012-01-01 09:05:15', '5847'],
['asdf', '2012-01-01 09:05:29', '3672'],
['asdf', '2012-01-01 09:05:30', '2367'],
['asdf', '2012-01-01 09:05:43', '9544'],
['asdf', '2012-01-01 14:48:15', '2572'],
['asdf', '2012-01-01 14:48:34', '7483'],
['asdf', '2012-01-01 14:48:56', '5782']]
The results should look something like this. A nested list of lists for each group.
[[['asdf', '2012-01-01 00:00:12', '1234'],
['asdf', '2012-01-01 00:00:31', '1235'],
['asdf', '2012-01-01 00:00:57', '2345'],
['asdf', '2012-01-01 00:01:19', '2346'],
['asdf', '2012-01-01 00:01:25', '2345']],
[['asdf', '2012-01-01 09:04:14', '3465'],
['asdf', '2012-01-01 09:04:34', '1613'],
['asdf', '2012-01-01 09:04:51', '8636'],
['asdf', '2012-01-01 09:05:15', '5847'],
['asdf', '2012-01-01 09:05:29', '3672'],
['asdf', '2012-01-01 09:05:30', '2367'],
['asdf', '2012-01-01 09:05:43', '9544']],
[['asdf', '2012-01-01 14:48:15', '2572'],
['asdf', '2012-01-01 14:48:34', '7483'],
['asdf', '2012-01-01 14:48:56', '5782']]]
The clusters are of no set size, and no set times. They can occur randomly throughout the day, and will need to cluster based on a large gap in the time.
The first group happens right after midnight and has 5 entries, the next one is centered around 09:05 and has 7 entries. The final one happens about 14:48 and only has 3 entries. I could also have two groups at either end of the hour as well, so I can not just group by the hour.
I have already sorted and grouped the data by the first field in the list, I just need to break them down into smaller chunks to process. I am willing the change the date to whatever format is necessary to get the grouping done as it is a key part of the analysis I am doing on the data.
I would prefer to keep the solution within the basic python libraries, but if there is no solution I can attempt to get other packages.
I have already looked at solutions here, here, here, here, and many others but none of them address the random nature of these times.
Splitting the list at any gap greater than X time would be a great solution, so I can change X to 5 or 10 minutes, whatever is deemed appropriate. Dropping any group that has length less than 3 would also be a bonus, but can easily be done at the end.
My only real idea right now is to loop through the list compare the current time with the new time and split the list that way, but it seems like a very inefficient way of solving this problem when there are millions of records to sort and group.
Any help would be greatly appreciated. If any of this doesn't make sense I will do my best to clarify.