2

I am running the following script in order to append files to one another by cycling through months and years if the file exists, I have just tested it with a larger dataset where I would expect the output file to be roughly 600mb in size. However I am running into memory issues. Firstly is this normal to run into memory issues (my pc has 8 gb ram) I am not sure how I am eating all of this memory space?

Code I am running

import datetime,  os
import StringIO

stored_data = StringIO.StringIO()

start_year = "2011"
start_month = "November"
first_run = False

current_month = datetime.date.today().replace(day=1)
possible_month = datetime.datetime.strptime('%s %s' % (start_month, start_year), '%B %Y').date()
while possible_month <= current_month:
    csv_filename = possible_month.strftime('%B %Y') + ' MRG.csv'
    if os.path.exists(csv_filename):
        with open(csv_filename, 'rb') as current_csv:
            if first_run != False:
                next(current_csv)
            else:
                first_run = True
            stored_data.writelines(current_csv)
    possible_month = (possible_month + datetime.timedelta(days=31)).replace(day=1)
if stored_data:
    contents = stored_data.getvalue()
    with open('FullMergedData.csv', 'wb') as output_csv:
        output_csv.write(contents)

The trackback I receive:

Traceback (most recent call last):
  File "C:\code snippets\FullMerger.py", line 23, in <module>
    contents = stored_output.getvalue()
  File "C:\Python27\lib\StringIO.py", line 271, in getvalue
    self.buf += ''.join(self.buflist)
MemoryError

Any ideas how to achieve a work around or make this code more efficient to overcome this issue. Many thanks
AEA

Edit1

Upon running the code supplied alKid I received the following traceback.

Traceback (most recent call last):
  File "C:\FullMerger.py", line 22, in <module>
    output_csv.writeline(line)
AttributeError: 'file' object has no attribute 'writeline'

I fixed the above by changing it to writelines however I still received the following trace back.

Traceback (most recent call last):
  File "C:\FullMerger.py", line 19, in <module>
    next(current_csv)
StopIteration
AEA
  • 213
  • 2
  • 12
  • 34
  • Could you write directly to the csv instead of storing the data in memory then writing? – SethMMorton Nov 01 '13 at 02:54
  • @SethMMorton yes potentially, but I am trying to avoid too many filewrites. The script above for smaller datafiles is instant, where as other methods which involve multiple reads and writes have proven slow. – AEA Nov 01 '13 at 03:19

2 Answers2

5

In stored_data, you're trying to store the whole file, and since it's too large, you're getting the error you are showing.

One solution is to write the file per line. It is far more memory-efficient, since you only store a line of data in the buffer, instead of the whole 600 MB.

In short, the structure can be something this:

with open('FullMergedData.csv', 'a') as output_csv: #this will append  
# the result to the file.
    with open(csv_filename, 'rb') as current_csv:
        for line in current_csv:   #loop through the lines
            if first_run != False:
                next(current_csv)
                first_run = True #After the first line,
                #you should immidiately change first_run to true.
            output_csv.writelines(line)  #write it per line

Should fix your problem. Hope this helps!

aIKid
  • 26,968
  • 4
  • 39
  • 65
  • Thanks for the reply, I didnt know how update my code with the ideas posted in my old question, hence I have input my full code here. I have added tracebacks to my question. Many thanks – AEA Nov 01 '13 at 03:24
  • Ah, you never changed `first_run` to True. – aIKid Nov 01 '13 at 03:26
  • Also, if this solves your question, please do upvote, and accept :) – aIKid Nov 01 '13 at 03:29
  • Hey @alKid whilst this did fix the traceback, the code does not work as it should, this code doesn't remember the last file as it cycles through the different files, i.e. it overwrites it when it cycles to the next monthly file. – AEA Nov 01 '13 at 03:35
  • Yep indeed, also it isnt by any means quick. The original code was instant this will take a few minutes to run (not ideal) – AEA Nov 01 '13 at 03:38
  • How did the original code works? You store the whole data in the buffer? – aIKid Nov 01 '13 at 03:41
  • The fix worked correctly, however for data adding upto about 600mb it takes 3-4 minutes, the original code essential did the same thing but kept it in memory and wrote it all at once to the output file. Using my original code with half my data it creates half my data it creates the file in under 0.304 of a second where as with your version it takes 49.435 which as you see is significantly slower. – AEA Nov 01 '13 at 03:50
  • That's so wrong, since executing it in 0.3 seconds is impossible. Have you tried moving a 600mb file from a flashdisk to your computer? Your original code is either didn't work, or it doesn't produce your expected result. You got a `MemoryError` for that, right? – aIKid Nov 01 '13 at 03:52
  • I just ran it on an even smaller set of data and the times came out as follows: `AEA Routine took 0.190 seconds` and `alKid Routine took 2.738 seconds`, I used notepad++'s compare and both files are identical. Running the 600mb worth of data gets a `MemoryError` yes. – AEA Nov 01 '13 at 03:59
  • For smaller sets, that's clear, since my method is designed to process big data, and your original code was suppose to handle only small data. But trust me, for your big file, it would be hard to find a faster solution. – aIKid Nov 01 '13 at 04:02
3

Your memory error is because you store all the data in a buffer before writing it. Consider using something like copyfileobj to directly copy from one open file object to another, this will only buffer small amounts of data at a time. You could also do it line by line, which will have much the same effect.

Update

Using copyfileobj should be much faster than writing the file line by line. Here is an example of how to use copyfileobj. This code opens two files, skips the first line of the input file if skip_first_line is True and then copies the rest of that file to the output file.

skip_first_line = True

with open('FullMergedData.csv', 'a') as output_csv:
    with open(csv_filename, 'rb') as current_csv:
        if skip_first_line:
            current_csv.readline()
        shutil.copyfileobj(current_csv, output_csv)

Notice that if you're using copyfileobj you'll want to use current_csv.readline() instead of next(current_csv). That's because iterating over a file object buffers part of the file, which is normally very useful, but you don't want that in this case. More on that here.

Community
  • 1
  • 1
Bi Rico
  • 25,283
  • 3
  • 52
  • 75
  • I looked in the documentation but was unable to find a suitable example. Are you able to provide a small example of how to use it correctly? – AEA Nov 01 '13 at 03:27
  • 1
    `copyfileobj` is used to copy the whole file, you should skip if you want to remove the first line. – aIKid Nov 01 '13 at 03:29
  • @aIKid copyfileobj starts copying from the current position of the file pointer. This means you can just read as much of the file as you'd like to skip and then start copying. Of course this means you need to watch out for any kind of buffering. – Bi Rico Nov 01 '13 at 17:05