6

I am trying to calculate the origin and offset of variable size arrays and store them in a dictionary. Here is the likely non-pythonic way that I am achieving this. I am not sure if I should be looking to use map, a lambda function, or list comprehensions to make the code more pythonic.

Essentially, I need to cut chunks of an array up based on the total size and store the xstart, ystart, x_number_of_rows_to_read, y_number_of_columns_to_read in a dictionary. The total size is variable. I can not load the entire array into memory and use numpy indexing or I definitely would. The origin and offset are used to get the array into numpy.

intervalx = xsize / xsegment #Get the size of the chunks
intervaly = ysize / ysegment #Get the size of the chunks

#Setup to segment the image storing the start values and key into a dictionary.
xstart = 0
ystart = 0
key = 0

d = defaultdict(list)

for y in xrange(0, ysize, intervaly):
    if y + (intervaly * 2) < ysize:
        numberofrows = intervaly
    else:
        numberofrows = ysize - y

    for x in xrange(0, xsize, intervalx):
        if x + (intervalx * 2) < xsize:
            numberofcolumns = intervalx

        else:
            numberofcolumns = xsize - x
        l = [x,y,numberofcolumns, numberofrows]
        d[key].append(l)
        key += 1
return d

I realize that xrange is not ideal for a port to 3.

Jzl5325
  • 3,898
  • 8
  • 42
  • 62
  • 1
    xrange is fine -- 2to3 handles that one without any problems. – mgilson Jul 18 '12 at 20:43
  • have you considered `h5py`. It allows you to use [`numpy` syntax to work with arrays](http://h5py.alfven.org/docs-2.0/intro/quick.html#getting-your-data-back) without loading all elements into memory – jfs Jul 18 '12 at 21:01
  • I have considered both h5py and using numpy.memmap, but do not believe I can apply them. Specifically, the array is an image, not raw array, and I am using GDAL to read the image as a numpy array. I would need to strip off the header, then process the array, then reapply the header. Would direct disk access be possible / better? – Jzl5325 Jul 18 '12 at 21:15

4 Answers4

7

This code looks fine except for your use of defaultdict. A list seems like a much better data structure because:

  • Your keys are sequential
  • you are storing a list whose only element is another list in your dict.

One thing you could do:

  • use the ternary operator (I'm not sure if this would be an improvement, but it would be fewer lines of code)

Here's a modified version of your code with my few suggestions.

intervalx = xsize / xsegment #Get the size of the chunks
intervaly = ysize / ysegment #Get the size of the chunks

#Setup to segment the image storing the start values and key into a dictionary.
xstart = 0
ystart = 0

output = []

for y in xrange(0, ysize, intervaly):
    numberofrows = intervaly if y + (intervaly * 2) < ysize else ysize -y
    for x in xrange(0, xsize, intervalx):
        numberofcolumns = intervalx if x + (intervalx * 2) < xsize else xsize -x
        lst = [x, y, numberofcolumns, numberofrows]
        output.append(lst)

        #If it doesn't make any difference to your program, the above 2 lines could read:
        #tple = (x, y, numberofcolumns, numberofrows)
        #output.append(tple)

        #This will be slightly more efficient 
        #(tuple creation is faster than list creation)
        #and less memory hungry.  In other words, if it doesn't need to be a list due
        #to other constraints (e.g. you append to it later), you should make it a tuple.

Now to get your data, you can do offset_list=output[5] instead of offset_list=d[5][0]

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • Thanks, I had not considered using a list, but it does make more sense than using a dictionary as I do not need to track position by key. – Jzl5325 Jul 18 '12 at 22:03
  • 2
    a tuple or even a namedtuple instead of a sublist seems like a better fit here. – jfs Jul 20 '12 at 09:52
  • @J.F.Sebastian -- Why do you say that? The OP is using sequential numbers starting from 0. Why is a namedtuple better for that? Building this as a tuple initially would be difficult. Of course, converting it to a tuple after the fact is trivial, but I'm not really sure what the point is in doing that ... – mgilson Jul 20 '12 at 12:16
  • if you drop `[]` in the `lst = [...]` line you get a tuple. Nothing difficult – jfs Jul 20 '12 at 12:20
  • @J.F.Sebastian -- OH, that list. Yes, you're right, tuples are better for that one. (I thought you were talking about the list `output`, but I suppose that is why you used the term 'sublist' :D.) I'll edit. – mgilson Jul 20 '12 at 12:21
  • @Jzl5325 -- See the comments I've added due to a great comment by J.F.Sebastian. If you take his suggestion, it should make this code slightly more efficient with almost no work on your part. – mgilson Jul 20 '12 at 12:27
  • @mgilson Thanks! I did not know that a tuple would be a better choice and / or faster. I will make the change. Unfortunately, these lines are fast already, but I will take any speedup to offset the IO issues I have using GDAL. – Jzl5325 Jul 20 '12 at 14:29
  • I wanted to get a better handle on this as I assumed the only difference was mutability, and came across this post. Quite good. http://stackoverflow.com/a/1708538/839375 – Jzl5325 Jul 20 '12 at 14:42
  • @Jzl5325: the reason to use tuples is not speed but the semantics that they communicate: [Tuples have structure, lists have order.](http://stackoverflow.com/a/626871/4279) – jfs Jul 21 '12 at 03:15
0

Although it doesn't change your algorithm, a more pythonic way to write your if/else statements is:

numberofrows = intervaly if y + intervaly * 2 < ysize else ysize - y

instead of this:

if y + (intervaly * 2) < ysize:
    numberofrows = intervaly
else:
    numberofrows = ysize - y

(and similarly for the other if/else statement).

kamek
  • 2,390
  • 2
  • 19
  • 15
  • Why's that more Pythonic? It's much harder to parse. Ternary conditionals should be used sparingly. – Henry Gomersall Jul 18 '12 at 20:56
  • I checked out the ternary posting on wikipedia and am not seeing the improvement in either readability or speed. What is the purpose of ternary conditionals in a language like python? – Jzl5325 Jul 18 '12 at 22:05
  • Since what is 'pythonic' is subjective outside of PEP8, this is just what I was taught. I personally find it just as readable, and in some cases more readable, especially in cases where they similar constructs occur multiple times in the same block. Anyway, to each his/her own. – kamek Jul 19 '12 at 07:45
  • 1
    @Jzl5325 The ternary conditional was only added in 2.5, so it clearly wasn't core. I think its value is when you have a simple boolean variable: `extinguisher = water if paper_fire else co2`. – Henry Gomersall Jul 19 '12 at 09:30
  • @HenryGomersall That does make sense and is more readable. Thanks. – Jzl5325 Jul 19 '12 at 14:50
0

Have you considered using np.memmap to load the pieces dynamically instead? You would then just need to determine the offsets that you need on the fly rather than chunking the array storing the offsets.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html

JoshAdel
  • 66,734
  • 27
  • 141
  • 140
0

This is a long one liner :

d = [(x,y,min(x+xinterval,xsize)-x,min(y+yinterval,ysize)-y) for x in 
xrange(0,xsize,xinterval) for y in xrange(0,ysize,yinterval)]
Marco de Wit
  • 2,686
  • 18
  • 22