13

I'm working on a project that involves using python to read, process and write files that are sometimes as large as a few hundred megabytes. The program fails occasionally when I try to process some particularly large files. It does not say 'memory error', but I suspect that is the problem (in fact it gives no reason at all for failing').

I've been testing the code on smaller files and watching 'top' to see what memory usage is like, and it typically reaches 60%. top says that I have 4050352k total memory, so 3.8Gb.

Meanwhile I'm trying to track memory usage within python itself (see my question from yesterday) with the following little bit of code:

mem = 0
for variable in dir():
    variable_ = vars()[variable]
    try: 
        if str(type(variable_))[7:12] == 'numpy':
            numpy_ = True
        else:
            numpy_ = False
    except:
        numpy_ = False
    if numpy_:
        mem_ = variable_.nbytes
    else:
        mem_ = sys.getsizeof(variable)
    mem += mem_
    print variable+ type: '+str(type(variable_))+' size: '+str(mem_)
print 'Total: '+str(mem)

Before I run that block I set all the variables I don't need to None, close all files and figures, etc etc. After that block I use subprocess.call() to run a fortran program that is required for the next stage of processing. Looking at top while the fortran program is running shows that the fortran program is using ~100% of the cpu, and ~5% of the memory, and that python is using 0% of cpu and 53% of memory. However my little snippet of code tells me that all of the variables in python add up to only 23Mb, which ought to be ~0.5%.

So what's happening? I wouldn't expect that little snippet to give me a spot on memory usage, but it ought to be accurate to within a few Mb surely? Or is it just that top doesn't notice the memory has been relinquished, but that it is available to other programs that need it if necessary?

As requested, here's a simplified part of the code that is using up all the memory (file_name.cub is an ISIS3 cube, it's a file that contains 5 layers (bands) of the same map, the first layer is spectral radiance, the next 4 have to do with latitude, longitude, and other details. It's an image from Mars that I'm trying to process. StartByte is a value I previously read from the .cub file's ascii header telling me the beginning byte of the data, Samples and Lines are the dimensions of the map, also read from the header.):

latitude_array = 'cheese'   # It'll make sense in a moment
f_to = open('To_file.dat','w') 

f_rad = open('file_name.cub', 'rb')
f_rad.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_rad.read(StartByte-1))
header = None    
#
f_lat = open('file_name.cub', 'rb')
f_lat.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_lat.read(StartByte-1))
header = None 
pre=struct.unpack('%df' % (Samples*Lines), f_lat.read(Samples*Lines*4))
pre = None
#
f_lon = open('file_name.cub', 'rb')
f_lon.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_lon.read(StartByte-1))
header = None 
pre=struct.unpack('%df' % (Samples*Lines*2), f_lon.read(Samples*Lines*2*4))
pre = None
# (And something similar for the other two bands)
# So header and pre are just to get to the right part of the file, and are 
# then set to None. I did try using seek(), but it didn't work for some
# reason, and I ended up with this technique.
for line in range(Lines):
    sample_rad = struct.unpack('%df' % (Samples), f_rad.read(Samples*4))
    sample_rad = np.array(sample_rad)
    sample_rad[sample_rad<-3.40282265e+38] = np.nan  
    # And Similar lines for all bands
    # Then some arithmetic operations on some of the arrays
    i = 0
    for value in sample_rad:
        nextline = sample_lat[i]+', '+sample_lon[i]+', '+value # And other stuff
        f_to.write(nextline)
        i += 1
    if radiance_array == 'cheese':  # I'd love to know a better way to do this!
        radiance_array = sample_rad.reshape(len(sample_rad),1)
    else:
        radiance_array = np.append(radiance_array, sample_rad.reshape(len(sample_rad),1), axis=1)
        # And again, similar operations on all arrays. I end up with 5 output arrays
        # with dimensions ~830*4000. For the large files they can reach ~830x20000
f_rad.close()
f_lat.close()
f_to.close()   # etc etc 
sample_lat = None  # etc etc
sample_rad = None  # etc etc

#
plt.figure()
plt.imshow(radiance_array)
# I plot all the arrays, for diagnostic reasons

plt.show()
plt.close()

radiance_array = None  # etc etc
# I set all arrays apart from one (which I need to identify the 
# locations of nan in future) to None

# LOCATION OF MEMORY USAGE MONITOR SNIPPET FROM ABOVE

So I lied in the comments about opening several files, it's many instances of the same file. I only continue with one array that isn't set to None, and it's size is ~830x4000, though this somehow constitutes 50% of my available memory. I've also tried gc.collect, but no change. I'd be very happy to hear any advice on how I could improve on any of that code (related to this problem or otherwise).

Perhaps I should mention: originally I was opening the files in full (i.e. not line by line as above), doing it line by line was an initial attempt to save memory.

Community
  • 1
  • 1
EddyTheB
  • 3,100
  • 4
  • 23
  • 32
  • I've had similar problems using numpy in long running processes - it either leaks like a sieve or fragments the heap like none other. Are you creating a lot of numpy arrays and destroying them? – Yann Ramin Aug 03 '12 at 17:38
  • 4
    Python does not free memory back to the system immediately after it *destroys* some object instance. It has some object pools, called arenas, and it takes a while until those are released. In some cases, you may be suffering from memory fragmentation which also causes process' memory usage to grow. – C2H5OH Aug 03 '12 at 17:38
  • 1
    `sys.getsizeof()` is not a reliable way to test memory usage. First, it only tracks the memory of a given Python object, not the references it has to other items in memory. Second, according to the docs, it's not guaranteed to work correctly for 3rd party extensions. Also, what @C2H5OH said. – Joel Cornett Aug 03 '12 at 17:40
  • @YannRamin: Yes, I'm reading a single line each from multiple files at a time, turning them to numpy arrays, then I do some arithmetic on the arrays and then send the results to another file. Then do it again with the next lines from the original files. – EddyTheB Aug 03 '12 at 17:50
  • It would be helpful to see the actual code you are running that eats up the memory. Perhaps that can be improved. – Keith Aug 03 '12 at 18:04
  • It would be helpful to see the actual code you are running that eats up the memory. Perhaps that can be improved. – Keith Aug 03 '12 at 18:05
  • It's not that unreasonable that reading hundreds of megabytes in some compact file format, and storing both the results and some post-processed information in some easy-to-access format, would take a couple gigabytes. In which case you might need to change your design to create and use a file instead of keeping the whole thing in memory. – abarnert Aug 03 '12 at 18:12
  • @RussellBorogove: It stops. I'm afraid I can't remember exactly what it says when it does that, but I think something like 'process killed', or maybe even nothing, there's no python error message (my code takes a long time to run, so for big files I set it to go overnight, I'll set it to go on a big file tonight and add what it says when it fails to my question tomorrow.) – EddyTheB Aug 03 '12 at 18:38
  • If the OS is killing the process without even giving Python a chance, you might want to tell us which OS and what error it's giving. (My guess is that you're on linux, and you have fixed swap, and you've called the kernel's bluff on overcommitted memory, so it has to kill you, so the 4GB process limit is just a red herring.) But it probably doesn't matter; the answer is still to avoid using as much memory as you are. – abarnert Aug 03 '12 at 18:51
  • @Keith, see new edited question. – EddyTheB Aug 03 '12 at 19:13
  • @abarnert, It's Linux. (Linux 3.3.4-1.fc16.x86_64), and While I'm at it it has Python 2.7.3. – EddyTheB Aug 03 '12 at 19:17
  • What's the value of Lines? That's a little unusual use of `struct.unpack`, with such as large format string. You could just loop there also. You can also change `range` to `xrange`. – Keith Aug 04 '12 at 04:53
  • Have a look at the ObjTracker class in this file: http://bazaar.launchpad.net/~luke-campagnola/pyqtgraph/dev/view/head:/debug.py#L454 I use this for tracking down memory leaks in my own programs. – Luke Aug 05 '12 at 00:40

1 Answers1

15

Just because you've deferenced your variables doesn't mean the Python process has given the allocated memory back to the system. See How can I explicitly free memory in Python?.

If gc.collect() does not work for you, investigate forking and reading/writing your files in child processes using IPC. Those processes will end when they're finished and release the memory back to the system. Your main process will continue to run with low memory usage.

nbro
  • 15,395
  • 32
  • 113
  • 196
bioneuralnet
  • 5,283
  • 1
  • 24
  • 30
  • Thanks, that link has been informative. gc.collect had no effect, please could you explain a little more about forking? Would putting the code above (inside the 'for line in Lines' loop) inside a "os.waitpid(0,0) / newpid = os.fork() / if newpid == 0:" do the trick? – EddyTheB Aug 03 '12 at 19:24
  • If you can use the subprocess module instead of forking explicitly, that's usually better. But the basic idea is that if you have, e.g., 4 tasks that each need 1GB of memory (even if they only need it briefly), having four separate child processes that each use 1GB plus a bit of overhead instead of one parent that uses 4GB and crashes is a good thing. (Which is not necessarily true on a 64-bit OS, especially if you're crashing because you're out of total system RAM+swap or page space rather than OS memory space, but it's worth finding out.) – abarnert Aug 03 '12 at 19:34
  • @EddyTheB: That might be the right idea. The subprocess module might be appropriate, too. But with either of those, I'm assuming you understand there's no easy way to share data among the processes. And it sounds like you'll need to, since you're reading, processing and writing. – bioneuralnet Aug 04 '12 at 02:59
  • 2
    ...continued. Breaking your flow up into discrete operations is the best advice I can give. Maybe one (or more) subprocess reads the data, transforms it (using huge swaths of RAM) into some intermediate state, then writes it out to a tmp file. Your master process waits, then it (or another subprocess) reads the tmpfile data, which you thoughtfully formatted for easy array plotting (or whatever), and plots the arrays. That way, you keep releasing your RAM from finished operations. – bioneuralnet Aug 04 '12 at 03:17
  • Thanks for your advice. In the end I just read the files in full and got around the memory problem by only allowing a small section of each image to be processed. – EddyTheB Aug 16 '12 at 14:48