-1

I have two pieces of code (doing the same job) which takes in array of datetime and produces clusters of datetime which have difference of 1 hour.

First piece is:

def findClustersOfRuns(data):
    runClusters = []
    for k, g in groupby(itertools.izip(data[0:-1], data[1:]),
                        lambda (i, x): (i - x).total_seconds() / 3600):
        runClusters.append(map(itemgetter(1), g))

Second piece is:

def findClustersOfRuns(data):
    if len(data) <= 1:
        return []
        current_group = [data[0]]
        delta = 3600
        results = []

        for current, next in itertools.izip(data, data[1:]):
            if abs((next - current).total_seconds()) > delta:
                # Here, `current` is the last item of the previous subsequence
                # and `next` is the first item of the next subsequence.
                if len(current_group) >= 2:
                    results.append(current_group)

                current_group = [next]
                continue

            current_group.append(next)

        return results

The first code takes 5 minutes to execute while second piece takes few seconds. I am trying to understand why.

The data over which I am running the code has size:

data.shape
(13989L,)

The data contents is as:

data
array([datetime.datetime(2016, 10, 1, 8, 0),
       datetime.datetime(2016, 10, 1, 9, 0),
       datetime.datetime(2016, 10, 1, 10, 0), ...,
       datetime.datetime(2019, 1, 3, 9, 0),
       datetime.datetime(2019, 1, 3, 10, 0),
       datetime.datetime(2019, 1, 3, 11, 0)], dtype=object)

How do I improve the first piece of code to make it run as fast?

martineau
  • 119,623
  • 25
  • 170
  • 301
Zanam
  • 4,607
  • 13
  • 67
  • 143
  • 2
    Why downvote, can you please explain? – Zanam Oct 07 '16 at 10:53
  • I believe it is due to the usage of lambda – theBugger Oct 07 '16 at 10:55
  • Sorry, I don't know the reason for the massive time difference, but FWIW, using literal tuples in function args is deprecated, and not permitted in Python 3, so your `lambda (i, x): ...` won't work in Py3. Also, you shouldn't use `next` as a variable name as that shadows the built-in `next` function. It won't hurt here, but it is confusing to people reading your code. – PM 2Ring Oct 07 '16 at 11:02
  • 1
    One of the first things you should do when optimizing a piece of code is profile it to determine where it spends most of its time. See [_How can you profile a Python script?_](http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) – martineau Oct 07 '16 at 12:44

1 Answers1

1

Based on the size, it looks you are having a huge list of elements i.e. huge len. Your second code is having just one for loop where as your first approach has many. You see just one right? They are in the form of map(), groupby(). Multiple iteration on the huge list is adding huge cost to the time complexity. These are not just additional iterations, but also these are slower than the normal for loop.

I made a comparison for another post which you might find useful Comparing list comprehensions and explicit loops.

Also, the usage of lambda function is adding up extra time.

However, you may further improve the execution time of the code by storing results.append to a separate variable say my_func and make a call as my_func(current_group).

Few more comparisons are:

Community
  • 1
  • 1
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126