0

I have a function which resembles:

def long_running_with_more_values(start, stop):
    headers = get_headers.delay(start, stop)
    insert_to_db.delay(headers)

This function is batch processing data that is requested from the net in parallel. get_headers + insert_to_db is firing off to the message stack and is processed in the end by celery workers, so is not blocking execution.

It has to process every number between start and stop, but can split this up into sections (ranges).

I've found that the operation get_headers is optimal when the range is ~20000 where range = (stop - start)

I want to know how I can split an arbitrary range into groups of 20000 and run each group through the function so I end up with the function being called multiple times with different start and stop values, but still covering the previous range in total.

so for starting values for start and stop of 1 and 100000 respectively i'd expect get_headers to be called 5 times with the following:

[1,20000][20001,40000][40001,60000][60001,80000][80001,100000]
Jharwood
  • 1,046
  • 2
  • 11
  • 28
  • If you're just looking for a way to split a list into segments of `n` elements, [see this question](http://stackoverflow.com/questions/1624883/alternative-way-to-split-a-list-into-groups-of-n). – kojiro Mar 10 '13 at 00:56
  • no i'm not, i'm looking to split up a task that processes items by specifying a range of two ID's into subtasks that are more efficient – Jharwood Mar 10 '13 at 00:59
  • Consider `range(start, stop, 20000)` to get your partition boundaries. – kojiro Mar 10 '13 at 01:10

1 Answers1

1
def long_running_with_more_values(start, stop):
    while start < stop:
        if stop - start < 20000:
            headers = get_headers.delay(start, stop)
            break
        else:
            headers = get_headers.delay(start, start + 20000)
            start += 20000
    insert_to_db.delay(headers)

Notice that headers will only store the return value of the last call to get_headers.delay(). You might need to change the code to headers += get_headers.delay(start, stop). I can't really tell without knowing what the return value of the get_headers.delay() method is.

Ionut Hulub
  • 6,180
  • 5
  • 26
  • 55
  • the problem with this from what I can see is you'll end up with headers outside the range you specify, which according to spec, isn't allowed. – Jharwood Mar 11 '13 at 09:06
  • I don't see how that could happen. Can you please provide an example? – Ionut Hulub Mar 11 '13 at 18:52