1

I need to apply two running filters on a large amount of data. I have read that creating variables on the fly is not a good idea, but I wonder if it still might be the best solution for me.

My question: Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?

Why I want to do it: The data are 400x700 arrays for 15min time steps over a year (So I have 35000 400x700 arrays). Each time step is read into python individually. Now I need to apply one running filter that checks if the last four time steps are equal (element-wise) and if they are, then all four values are set to zero. The next filter uses the data after the first filter has run and checks if the sum of the last twelve time steps exceeds a certain value. When both filters are done I want to sum up the values, so that at the end of the year I have one 400x700 array with the filtered accumulated values.

I do not have enough memory to read in all the data at once. So I thought I could create a loop where for each time step a new variable for the 400x700 array is created and the two filters run. The older arrays that are filtered I could then add to the yearly sum and delete, so that I do not have more than 16 (4+12) time steps(arrays) in memory at all times.

I don’t now if it’s correct of me to ask such a question without any code to show, but I would really appreciate the help.

  • 1
    You should provide some code that generates a small sample of data of the right dimensions and datatype. Either by hardcoding some values or by using the numpy random functions. See [this question](http://stackoverflow.com/q/19268937/553404) for a good example. – YXD May 13 '14 at 10:53
  • It's not really clear what you are asking here. Are you looking for a way to make the above operation fast? If that is the case then you should consider reading the data in to memory in blocks of a month or so. For each new block you can "rewind" so that you get the overlap for your filters. – ebarr May 13 '14 at 10:57
  • what format is your data in? (i hope not 35000 txt files) Is it a single file with a ndarray of 400x700x35000 float64 in it? The pyfits package is really nice to work with big datasets like that, since it allows you to read in parts of your dataset, and there are other packages for other formats as well. Can you show the implementation of your filters, to know if your bottleneck is your filters or your IO speed. – usethedeathstar May 13 '14 at 11:32
  • Maybe what @usethedeathstar describes with `pyfits` is more efficient but Your outline seems reasonable to me. Although you will only ever need to keep as many arrays in memory as needed by the filter that looks back further, i.e., 12 in this case plus the summed array and not 16. – Midnighter May 13 '14 at 12:18

2 Answers2

0

If your question is about the best data structure to keep a certain amount of arrays in memory, in this case I would suggest using a three dimensional array. It's shape would be (400, 700, 12) since twelve is how many arrays you need to look back at. The advantage of this is that your memory use will be constant since you load new arrays into the larger one. The disadvantage is that you need to shift all arrays manually.

If you don't want to deal with the shifting yourself I'd suggest using a deque with a maxlen of 12.

Midnighter
  • 3,771
  • 2
  • 29
  • 43
0

"Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?"

This is a very common question that I think a lot of programmers will face eventually. Two examples for Python on Stack Overflow:

The lesson to learn from this is to not use dynamic variable names, but instead put the pieces of data you want to work with in an encompassing data structure.

The data structure could e.g. be a list, dict or Numpy array. Also the collections.deque proposed by @Midnighter seems to be a good candidate for such a running filter.

Community
  • 1
  • 1