2

I'm working on a project with big data and I get MemoryError often when I run my script. It contains a loop on a list of files which are read by my script and after 3 or 4 files, this error appears.

I thought write something like that :

with open("E:\New_Fields\liste_essai.txt", "r") as f :

    fichier_entier = f.read()
    files = fichier_entier.split("\n")

for fichier in files :

    with open(fichier, 'r') :

    # CONDITIONS

    del var1
    del var2
    del var3

To this way, I can free memory to the next loop, that's to say, the next file.

But there is a method which delete all variables in my loop with just one command instead of make this process manually ? In my script, I have maybe 15 variables so from my point of view it's not optimized to remove each variable one after the other.

EDIT :

My list of files is like that :

E:\New_Fields\Field101_combined_final_roughcal.fits
E:\New_Fields\Field117_combined_final_roughcal.fits
E:\New_Fields\Field150_combined_final_roughcal.fits
E:\New_Fields\Field36_combined_final_roughcal.fits
E:\New_Fields\Field41_combined_final_roughcal.fits
E:\New_Fields\Field169_combined_final_roughcal.fits
E:\New_Fields\Field47_combined_final_roughcal.fits
E:\New_Fields\Field43_combined_final_roughcal.fits
E:\New_Fields\Field39_combined_final_roughcal.fits
E:\New_Fields\Field45_combined_final_roughcal.fits
E:\New_Fields\Field6_combined_final_roughcal.fits
E:\New_Fields\Field49_combined_final_roughcal.fits
E:\New_Fields\Field51_combined_final_roughcal.fits

SCRIPT :

# -*- coding: utf-8 -*-
#!/usr/bin/env python

from astropy.io import fits
import numpy as np

                ###################################
                # Fichier contenant le champ brut #
                ###################################

with open("E:\New_Fields\liste_essai.txt", "r") as f :

    fichier_entier = f.read()
    files = fichier_entier.split("\n")

for fichier in files :

    with open(fichier, 'r') :

            outname = fichier.replace('combined_final_roughcal', 'mask')

        # Ouverture du fichier à l'aide d'astropy  
        field = fits.open(fichier)   
        print "Ouverture du fichier : " + str(fichier)     
        print " "  

        # Lecture des données fits
        tbdata = field[1].data   
        print "Lecture des données du fits"            

                        ###############################
                        # Application du tri sur PROB #
                        ###############################

        mask = np.bitwise_and(tbdata['PROB'] < 1.1, tbdata['PROB'] > -0.1)  
        new_tbdata = tbdata[mask]   
        print "Création du Masque"       
        print " "

                    #################################################
                    # Détermination des valeurs extremales du champ #
                    #################################################

        # Détermination de RA_max et RA_min 
        RA_max = np.max(new_tbdata['RA'])
        RA_min = np.min(new_tbdata['RA'])
        print "RA_max vaut :     " + str(RA_max)
        print "RA_min vaut :     " + str(RA_min)

        # Détermination de DEC_max et DEC_min   
        DEC_max = np.max(new_tbdata['DEC'])
        DEC_min = np.min(new_tbdata['DEC'])
        print "DEC_max vaut :   " + str(DEC_max)
        print "DEC_min vaut :   " + str(DEC_min)

                    #########################################
                    # Calcul de la valeur centrale du champ #
                    #########################################

        # Détermination de RA_moyen et DEC_moyen
        RA_central = (RA_max + RA_min)/2.
        DEC_central = (DEC_max + DEC_min)/2.

        print "RA_central vaut : " + str(RA_central)
        print "DEC_central vaut : " + str(DEC_central)

        print " "
        print " ------------------------------- "
        print " "

                ##############################
                # Détermination de X et de Y #
                ##############################


        # Creation du tableau
        new_col_data_X = array = (new_tbdata['RA'] - RA_central) * np.cos(DEC_central)
        new_col_data_Y = array = new_tbdata['DEC'] - DEC_central
        print 'Création du tableau'


        # Creation des nouvelles colonnes
        col_X = fits.Column(name='X', format='D', array=new_col_data_X)
        col_Y = fits.Column(name='Y', format='D', array=new_col_data_Y)
        print 'Création des nouvelles colonnes X et Y'


        # Creation de la nouvelle table
        tbdata_final = fits.BinTableHDU.from_columns(new_tbdata.columns + col_X + col_Y)

        # Ecriture du fichier de sortie .fits
        tbdata_final.writeto(outname)
        print 'Ecriture du nouveau fichier mask : ' + outname

        del field, tbdata, mask, new_tbdata, new_col_data_X, new_col_data_Y, col_X, col_Y, tbdata_final


        print " "
        print " ......................................................................................"
        print " "

Thank you ;)

Essex
  • 6,042
  • 11
  • 67
  • 139
  • 3
    Dont read the files at once, read line by line, once the names get reassigned or go out of scope the old values will be gc'd so you should not need to do anything, it is almost certainly because you have large files and reading the whole lot into memory at once, in your first example you actually briefly keep two full copies – Padraic Cunningham Apr 14 '16 at 17:31
  • Redefine files to `files = []` – spectre-d Apr 14 '16 at 17:34
  • I just need to replace fichier_entier.split("\n") by [] ? – Essex Apr 14 '16 at 17:37
  • No, `for fle in f:with open(fle.rstrip()) as tmp...`, forget splitting and iterate over the file object, if you do need to split some data then split per line not the whole content at once – Padraic Cunningham Apr 14 '16 at 17:39
  • @PadraicCunningham I have a list of files (as write in my edited question). My script read the first line, make some operations, load the result and go to the next line etc ... – Essex Apr 14 '16 at 17:45
  • I've no idea what you're asking about. Are you looking for `del var1,var2,var3` (one line)? `gc.collect()`? Some way to organize variables together? – ivan_pozdeev Apr 14 '16 at 17:55
  • @ivan_pozdeev I want to free memory after each loop. But I have lots of variables so what is the best way to delete variables ? write each one or another way ? – Essex Apr 14 '16 at 17:57
  • At what stage do you get a memory error? – Padraic Cunningham Apr 14 '16 at 18:05
  • After 2 or 3 readed files. Files are between 100Mo and 4Go of data. In my script I read the file, I make a sort in order to reduce the size of each file with unwanted value. I add 3 columns and I save the new file. – Essex Apr 14 '16 at 18:17
  • Add the logic where the code errors, if you are sorting 4gig files then that is probably your issue – Padraic Cunningham Apr 14 '16 at 18:22
  • I don't really understand, but I will try to find a solution to my problem. I haven't the choice, my project need to handle huge files – Essex Apr 14 '16 at 18:35
  • Add the code that actually causes the error to your question, it obviously happens somewhere inside `for fichier in files :` so add that logic – Padraic Cunningham Apr 14 '16 at 18:37

2 Answers2

2

Looking at the astropy docs for opening-a-fits-file:

The open() function has several optional arguments which will be discussed in a later chapter. The default mode, as in the above example, is “readonly”. The open function returns an object called an HDUList which is a list-like collection of HDU objects.

So that creates a huge list in memory which is most likely your issue, there is a section working-with-large-files:

The open() function supports a memmap=True argument that allows the array data of each HDU to be accessed with mmap, rather than being read into memory all at once. This is particularly useful for working with very large arrays that cannot fit entirely into physical memory.

That should help reduce the memory consumption, the only issues with mmap are as the docs mention, you would be limited on a 32 bit system to files around 2-e gigs but you would also be limited by physical memory on a 32 bit system so your 4 gig file would not fit in memory. There may be other ways to limit your memory usage but try using mmap and see how it works.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thank you to your answer, I will read deeply your links and try to solve the problem ;) – Essex Apr 14 '16 at 22:41
  • I will try to add `fits.open(file, 'readonly', memmap=True)` and watch if it's better or not – Essex Apr 14 '16 at 23:11
1

First, I'll answer your specific question (do note that in your case, it's not a real solution):

finally:
    del var1, var2, etc
    gc.collect()

or

  • make them go out of scope, e.g. separate the loop into a separate function

You can't "automate" the process beyond that because if you only need to get rid of some variables, Python can't know which ones unless you tell it exactly.
gc.collect() is needed because, as a runtime with garbage collection, Python doesn't "delete" but "unbind" objects. Normally, you're happy to wait till the next automatic collection, but not in this case.

Alternatively, some scopes can be edited as dict's, but that's not the primary way to do it, and function scopes cannot be edited like this in CPython anyway.


Now, the real problems you have are in the bad design:

  • If you're hitting MemoryError on a regular basis, this already means you're doing it wrong and/or your algorithm is inadequate for your environment. As MemoryError hook in Python? says, an out-of-memory condition cannot really be handled reliably by code that's not a part of the memory manager and in a garbage-collected environment, it's the garbage collector that's supposed to handle memory, not you.

    • specifically for "doing it wrong": even in the code you posted, I see a lot of redundant copies
  • If you have so many variables that you created and used in the first place - why is deleting them at the end such a trouble for you?
    This is a tell that your scope is too large, and this part needs to be either

    • split out as a separate function, and/or
    • split into smaller parts, with a cleanup after each
Community
  • 1
  • 1
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
  • Thank you so much to your answer ! I gonna try to solve my problem with your advices :) – Essex Apr 14 '16 at 22:42