I am running the following code:
from myUtilities import myObject
for year in range(2006,2015):
front = 'D:\\newFilings\\'
back = '\\*\\dirTYPE\\*.sgml'
path = front + str(year) + back
sgmlFilings = glob.glob(path)
for each in sgmlFilings:
header = myObject(each)
try:
tagged = header.process_tagged('G:')
except Exception as e:
outref = open('D:\\ProblemFiles.txt','a')
outref.write(each '\n')
outref.close()
print each
If I start from a reboot the memory allocation/consumption by python is fairly small. Over time though it increases significantly and ultimately after about a day I have very little free memory (24GB installed [294 mb free 23960 cached]) and the memory claimed by Python in the Windows Task Manager list is 3GB. I am watching this increase over the three days it takes to run the code against the file collection.
I was under the impression that since I am doing everything with
tagged = header.process_tagged('G:')
that the memory associated with each loop would be freed and garbage collected.
Is there something I can do to force the release of this memory. While I have not run statistics yet I can tell by watching activity on the disk that the process slows down as time (and the memory ~lump~ gets bigger) progresses
EDIT
I looked at the question referenced below and I do not think these are the same as the issue as I understand in the other question is that they are holding onto the objects (list of triangles) and need the entire list for computation. With each loop I am reading a file, performing some processing of the file and then writing it back out to disk. And then I am reading the next file . . .
With regards to possible memory leaks I am using LXML in myObject
Note, I added the line from MyUtilities import myObject since the first iteration of this question. MyUtilities holds the code that does everything
Regarding posting my code for myUtilities - that gets away from the basic question - I am done with header and tagged after each iteration tagged does stuff and writes the results to another drive, as a matter of fact a newly formatted drive.
I looked into using multiprocessing but I didn't because of a vague idea that since this is so I/O intensive that I would be competing for the drive heads - maybe that is wrong but since each iteration requires that I write a couple of hundred MB files I would think I would be competing for write and or even read time.
UPDATE - so I had one case in the myObjectclass where a file was opened with
myString = open(somefile).read()
I changed that to
with open(somefile,'r') as fHandle:
` myString = fHandle.read()`
(sorry for the formatting - still struggling)
However, this had no apparent affect, When I started a new cycle I had 4000 mb of Cached memory, after 22 minutes and processing of 27K files I had roughly 26000 mb of Cached memory.
I appreciate all of the answers and comments below and have been reading up and testing various things all day. I will update this as I thought this task would take a week and now it looks like it might take over a month.
I keep getting questions about the rest of the code. However, it is over 800 lines and to me that sort of gets away from the central question
So an instance of myObject is created Then we apply methods contained in myObject to header
This is basically file transformation. a file is read in, and copies of parts of the file are made and written to disk.
The central question to me is that there is obviously some persistence with either header or tagged. How can I dispose of everything related to header or tagged before I start the next cycle.
I have been running the code now for the last 14 hours or so. When it went through the first cycle it took about 22 minutes to process 27K files, now it is taking an hour and a half to handle approximately the same number.
Just running gc.collect does not work. I stopped the program and tried that in the interpreter and I saw no movement in memory statistics.
EDIT after reading the memoryallocator description from below I am thinking that the amount tied up in the cache is not the problem - it is the amount tied up by the running python process. So new test is running the code from the command line. I am continuing to watch and monitor and will post more once I see what happens.
EDIT: still struggling but have set up the code to run from a bat file with data from one loop of sgmlFilings
(see above) the batch file looks like this
python batch.py
python batch.py
.
.
.
The batch.py starts by reading a queue file that has a list of directories to glob, it takes the first one off the list, updates the list and saves it and then it runs the header
and tagged
processes. Clumsy but since the python.exe is closed after each iteration python never accumulates memory and so the process is running at a consistent speed.