3

I am running the following code:

from myUtilities import myObject
for year in range(2006,2015):
    front = 'D:\\newFilings\\'
    back = '\\*\\dirTYPE\\*.sgml'
    path = front + str(year) + back
    sgmlFilings = glob.glob(path)
    for each in sgmlFilings:
        header = myObject(each)
        try:
            tagged = header.process_tagged('G:')
        except Exception as e:
            outref = open('D:\\ProblemFiles.txt','a')
            outref.write(each '\n')
            outref.close()
            print each

If I start from a reboot the memory allocation/consumption by python is fairly small. Over time though it increases significantly and ultimately after about a day I have very little free memory (24GB installed [294 mb free 23960 cached]) and the memory claimed by Python in the Windows Task Manager list is 3GB. I am watching this increase over the three days it takes to run the code against the file collection.

I was under the impression that since I am doing everything with

tagged = header.process_tagged('G:')

that the memory associated with each loop would be freed and garbage collected.

Is there something I can do to force the release of this memory. While I have not run statistics yet I can tell by watching activity on the disk that the process slows down as time (and the memory ~lump~ gets bigger) progresses

EDIT

I looked at the question referenced below and I do not think these are the same as the issue as I understand in the other question is that they are holding onto the objects (list of triangles) and need the entire list for computation. With each loop I am reading a file, performing some processing of the file and then writing it back out to disk. And then I am reading the next file . . .

With regards to possible memory leaks I am using LXML in myObject

Note, I added the line from MyUtilities import myObject since the first iteration of this question. MyUtilities holds the code that does everything

Regarding posting my code for myUtilities - that gets away from the basic question - I am done with header and tagged after each iteration tagged does stuff and writes the results to another drive, as a matter of fact a newly formatted drive.

I looked into using multiprocessing but I didn't because of a vague idea that since this is so I/O intensive that I would be competing for the drive heads - maybe that is wrong but since each iteration requires that I write a couple of hundred MB files I would think I would be competing for write and or even read time.

UPDATE - so I had one case in the myObjectclass where a file was opened with

myString = open(somefile).read()

I changed that to

with open(somefile,'r') as fHandle:

`    myString = fHandle.read()`

(sorry for the formatting - still struggling)

However, this had no apparent affect, When I started a new cycle I had 4000 mb of Cached memory, after 22 minutes and processing of 27K files I had roughly 26000 mb of Cached memory.

I appreciate all of the answers and comments below and have been reading up and testing various things all day. I will update this as I thought this task would take a week and now it looks like it might take over a month.

I keep getting questions about the rest of the code. However, it is over 800 lines and to me that sort of gets away from the central question

So an instance of myObject is created Then we apply methods contained in myObject to header

This is basically file transformation. a file is read in, and copies of parts of the file are made and written to disk.

The central question to me is that there is obviously some persistence with either header or tagged. How can I dispose of everything related to header or tagged before I start the next cycle.

I have been running the code now for the last 14 hours or so. When it went through the first cycle it took about 22 minutes to process 27K files, now it is taking an hour and a half to handle approximately the same number.

Just running gc.collect does not work. I stopped the program and tried that in the interpreter and I saw no movement in memory statistics.

EDIT after reading the memoryallocator description from below I am thinking that the amount tied up in the cache is not the problem - it is the amount tied up by the running python process. So new test is running the code from the command line. I am continuing to watch and monitor and will post more once I see what happens.

EDIT: still struggling but have set up the code to run from a bat file with data from one loop of sgmlFilings (see above) the batch file looks like this

python batch.py
python batch.py
 .
 .
 .

The batch.py starts by reading a queue file that has a list of directories to glob, it takes the first one off the list, updates the list and saves it and then it runs the header and tagged processes. Clumsy but since the python.exe is closed after each iteration python never accumulates memory and so the process is running at a consistent speed.

PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
  • http://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python – Colonel Beauvel Jun 27 '15 at 13:46
  • 1
    I would recommend that you post all of your code (there are some functions in here whose source is invisible to us). We could then analyze all of your code and try to minimize the memory footprint – inspectorG4dget Jun 27 '15 at 13:52
  • What libraries are you using? If you are using some library implemented in C it might have some memory leak... – Bakuriu Jun 27 '15 at 13:55
  • I'm afraid this question is too broad to answer, unless you give us specifics into what your processing code is actually doing. – Burhan Khalid Jun 28 '15 at 08:24

4 Answers4

11

The reason is CPython's memory management. The way Python manages memory make it hard for long running programs. When you explicitly free an object with del statement, CPython necessarily does not return allocated memory to the OS. It keeps the memory for further use in future. One way to work this problem around is to use Multiprocessing module and kill the process after you are done with the job and create another one. This way you free the memory by force and OS must free up the memory used by that child process. I have had exact same problem. Memory usage excessively increased over time to the point which system became unstable and unresponsive. I used a different technique with signals and psutil to work it around. This problem occurs commonly when you have a loop and need to allocate and deallocate data on a stack, for example.

You can read more about Python memory allocator here : http://www.evanjones.ca/memoryallocator/

This tool is also very helpful to profile memory usage : https://pypi.python.org/pypi/memory_profiler

One more thing, add slots to myObject, it seems you have fixed slots inside your object, this also helps with reducing ram usage. Objects without slots specified allocates more ram to take care of dynamic attributes you may add to them later : http://tech.oyster.com/save-ram-with-python-slots/

mitghi
  • 889
  • 7
  • 20
  • 1
    Thanks you provided some useful information - I am still reading but I think **slots** are not the answer because I am creating one object at a time. – PyNEwbie Jun 27 '15 at 15:29
  • 1
    Why don't you use multiprocessing module ? There is a nice Pool class which can be advantageous in two aspects: 1) release of memory is guaranteed, because child processes will terminate after finishing with job 2) utilization of all cpu cores available and you can iterate through files in parallel. – mitghi Jun 27 '15 at 15:35
  • I am going to try but like I said I am wondering if I/O is a bigger problem so I am trying to research this question. – PyNEwbie Jun 27 '15 at 15:43
  • 2
    It shouldn’t be a problem that Python doesn’t release the memory to the operating system as long as it reuses freed-but-not-unmapped memory. In particular, the memory usage should grow to some amount and then stabilize; it shouldn’t keep increasing boundlessly, as appears to be happening here. – icktoofay Jun 28 '15 at 03:10
2

You can force a garbage collection using the gc module. In particular the gc.collect() function.

However this may not solve your problem since, probably, the gc is running but you are either using a library/code that contains a memory leak, or the library/code is keeping some references somewhere. In any case I doubt that the gc is the issue here.

Sometimes you may have some code that is keeping alive references to objects that you want need. In such a case you can consider explicitly deleting them when they aren't needed anymore, however this doesn't seem the case.


Also keep in mind that the memory usage of the python process may actually be a lot smaller than reported by the OS. In particular calls to free() need not return the memory to the OS (usually this doesn't happen when performing small allocations) so what you see may be the highest peak of memory usage up to the point, not the current usage. Add to this the fact that Python uses an other layer of memory allocation on top of C's one, and this make it pretty hard to profile memory usage. However it the memory keeps going up and up this probably isn't the case.

You should use something like Guppy to profile memory usage.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
1

You have some measure of control over this stuff using the gc module. Specifically, you might try incorporating

gc.collect() 

in your loop's body.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • 2
    gc.collect() doesn't always return the memory to OS. I had the exact same problem and it did not helped. – mitghi Jun 27 '15 at 13:48
  • 1
    @mitghi I agree that you can't use it to control things like you can in C. Notwithstanding, I have seen cases where it does help (and also heard of cases where it doesn't, like you state). YMMV. – Ami Tavory Jun 27 '15 at 13:50
1

Before resorting to forced garbage collection (never a good idea); try something basic:

  1. Use glob.iglob, (a generator) instead of getting a list of all your files at once.

  2. In your myObject(each) method, make sure you are closing the file or use the with statement so it is automatically closed; otherwise it will remain in memory eating up space.

  3. Don't open and close files; just open the file once for writing in your exception handler.

As you haven't posted the actual code that's doing the processing (and thus, possibly the cause of your memory woes), it is difficult to recommend specifics.

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
  • This has run for three days and out of 2 million iterations there were 400 exceptions. I feel like I need to update the log because if it crashes then I have to start over to find the exceptions - they cannot be identified until a file is open. While I appreciate the comment about iglob (I didn't know it existed) there are only 40K or 50K items in the list so that is not using up the memory – PyNEwbie Jun 27 '15 at 14:17
  • The truth of the matter is that I open two files during processing and one of those I do use the with statement on the other it was myString = open(someFile).read(). I have never understood the difference between that and with (all of the Python examples I found for reading files used the open(file).read(). So I would say this looks promising. – PyNEwbie Jun 27 '15 at 14:24
  • `myString = open(someFile).read()` <- this is loading the entire file into memory - and more importantly, the file handler is left open for Python to deal with later. Don't do this unless you must. – Burhan Khalid Jun 27 '15 at 14:26
  • I have to have the entire file in memory, again it is the accumulation over time that is the issue, not any particular file. I am right now looking for some insight into this issue. I have already changed the code to `with`, unfortunately it will take 10 or 12 hours to know if that fixes the problem – PyNEwbie Jun 27 '15 at 14:30
  • It’s good practice to close files explicitly, but if you don’t, when all references to the file are gone, the garbage collector will go for it, at which point the finalizer will run and the file will be closed, so it is actually getting closed, so that’s not the source of the problem. – icktoofay Jun 28 '15 at 03:08