-1

I am using a python class that I'm using when filling using a for loop, and it is very slow when I'm looping through millions of data lines, and there obviously is a faster way. Perhaps I shouldn't be using a class at all, but I need to create a structure so that I can sort it.

Here is that class:

class Particle(object):
def __init__(self, ID, nH, T, metallicity,oxygen,o6,o7,o8):
    self.ID = ID
    self.nH = nH
    self.T = T
    self.metallicity = metallicity
    self.oxygen = oxygen
    self.o6 = o6
    self.o7 = o7
    self.o8 = o8

and here is how I first filled it up after reading all the individual arrays (ID, nH, T, etc.) using append, which is of course exceedingly slow:

partlist = []

for i in range(npart):
     partlist.append(Particle(int(ID[i]),nH[i],T[i],metallicity[i],oxygen[i],o6[i],o7[i],o8[i]))

This takes a couple hours for 30 million values, and obviously 'append' is not the right way to do it. I thought this was an improvement:

partlist = [Particle(int(ID[i]),nH[i],T[i],metallicity[i],oxygen[i],o6[i],o7[i],o8[i]) for i in range(npart)]

but this is taking probably just as long and hasn't finished after an hour.

I'm new to python, and using indexes is not "pythonic", but I am at a loss on how to make and fill up a python object in what should only take a few minutes.

Suggestions? Thanks in advance.

Ben O.
  • 79
  • 1
  • 5
  • 1
    you need to read up on the difference between `range()` and `xrange()` to begin with because 30 million things is 30 million things. –  Nov 21 '15 at 19:52
  • you can also avoid using `list`s in prior to generators. – sobolevn Nov 21 '15 at 19:59
  • and a list comprehension is just syntactic sugar for exactly what you are already doing. –  Nov 21 '15 at 20:01
  • Do you really need all the `Particle` objects in a list, or could you just use one at a time? A generator like `zip` (in Python 3, or `itertools.izip` in Python 2) might be a lot better for you if you only need one value at a time. – Blckknght Nov 21 '15 at 20:10
  • 1
    List comprehensions are not syntactic sugar. The loop has to repeatedly call the function `partlist.append`; the list comprehension uses the `LIST_APPEND` byte code to add to the list under construction. – chepner Nov 21 '15 at 20:21
  • marginally less time because it avoids the method lookup and dispatch, a few nanosecs, that is *insignificant* to the object creation time and space considerations, especially if this is swapping to disk. –  Nov 21 '15 at 20:24
  • Thanks all and Jarrod especially- the ability of numpy to create and manipulate arrays is most useful. I have it working in ~6 minutes, which is better than 4 hours! Obviously, I need to take a python class. I'll post my solution below. Feel free to critique. – Ben O. Nov 22 '15 at 05:01

2 Answers2

3

Use the correct tool for the job:

You need to research more efficient data structures to begin with. Regular objects are not going to be the best solution for what you are trying to do if you need the entire dataset in memory at once.

use xrange() instead of range()

range(30000000) creates a list of 30,000,000 numbers in memory, xrange() doesn't it evaluates like a generator would.

.

Use numpy to store and process the data in arrays efficiently.

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Processing:

Research stream processing and Map/Reduce approaches to processing the data. If you can avoid loading the entire data set into memory and process it as it is read you can avoid all the object creation and list building completely.

Past this 30,000,000 of something is 30,000,000 of something and if you do not have enough RAM for this in memory it is just going to swap to disk and grind away. But there is not enough information to know if you need the entire thing in a giant list to begin with.

Community
  • 1
  • 1
1

Thanks for the answers. Jarrod's point about using the multi-dimensional arrays in numpy was the most helpful. Here's what I have that works 40x faster now:

parttype = [('ID', int), ('nH', float), ('T', float), ('metallicity', float), ('oxygen', float), ('o6', float), ('o7', float), ('o8', float)]

partlist = np.zeros((npart,), dtype=parttype)

for i in xrange(npart):
    partlist[i] = (int(ID[i]),nH[i],T[i],metallicity[i],oxygen[i],o6[i],o7[i],o8[i])

Still a for loop, but works reasonably fast for my data (6 mins. vs. 4 hours)!

Ben O.
  • 79
  • 1
  • 5