I have two pieces of code (doing the same job) which takes in array of datetime
and produces clusters of datetime
which have difference of 1 hour.
First piece is:
def findClustersOfRuns(data):
runClusters = []
for k, g in groupby(itertools.izip(data[0:-1], data[1:]),
lambda (i, x): (i - x).total_seconds() / 3600):
runClusters.append(map(itemgetter(1), g))
Second piece is:
def findClustersOfRuns(data):
if len(data) <= 1:
return []
current_group = [data[0]]
delta = 3600
results = []
for current, next in itertools.izip(data, data[1:]):
if abs((next - current).total_seconds()) > delta:
# Here, `current` is the last item of the previous subsequence
# and `next` is the first item of the next subsequence.
if len(current_group) >= 2:
results.append(current_group)
current_group = [next]
continue
current_group.append(next)
return results
The first code takes 5 minutes to execute while second piece takes few seconds. I am trying to understand why.
The data over which I am running the code has size:
data.shape
(13989L,)
The data contents is as:
data
array([datetime.datetime(2016, 10, 1, 8, 0),
datetime.datetime(2016, 10, 1, 9, 0),
datetime.datetime(2016, 10, 1, 10, 0), ...,
datetime.datetime(2019, 1, 3, 9, 0),
datetime.datetime(2019, 1, 3, 10, 0),
datetime.datetime(2019, 1, 3, 11, 0)], dtype=object)
How do I improve the first piece of code to make it run as fast?