1

I am trying to read a 59 GB file and break it to a number of new files according to some id found in the begging of each line. I am running the code below which breaks with a memory error after producing 45GB files. The system memory remains in very low level during all time and suddenly creates a after the code runs for about 2 hours. I have 16GB ram. Am I using buffering wrongly? Any idea?

outputFile = '/home/.../folder1'
directory = '/home/.../folder2/'

with open(directory + 'aldk_tab_1mn.csv', 'r', buffering=50000000) as fin: 
    firstLine = fin.readline()
    print(firstLine)

    for line in fin:
        testChar = line[0:4]
        if testChar[0] == 'A' :
            if not os.path.exists(outputFile + '/A/' + testChar+'.csv'):   # first time open a file             
                with open(outputFile + '/A/' + testChar+'.csv', 'a') as foutA:
                    print('file', testChar, 'created')                    
                    foutA.write(firstLine)
                    foutA.write(line)          
            else: 
                with open(outputFile + '/A/' + testChar+'.csv', 'a') as foutA:
                    foutA.write(line)          
        else:
            if not os.path.exists(outputFile + '/B/' + testChar+'.csv'):   # first time open a file             
                with open(outputFile + '/B/' + testChar+'.csv', 'a') as foutB:
                    print('file', testChar, 'created')
                    foutB.write(firstLine)
                    foutB.write(line) 
            else: 
                with open(outputFile + '/B/' + testChar+'.csv', 'a') as foutB:
                    foutB.write(line)   

The produced error is

MemoryError                               
Traceback (most recent call last)
 <ipython-input-17-761f2fcce982> in <module>()
  6 
----> 7     for line in fin:
  8         testChar = line[0:4]
  9         if testChar[0] == 'A' :

MemoryError: 
saias
  • 406
  • 1
  • 3
  • 12
  • This may help: https://stackoverflow.com/a/14268804/4737952 – Ashish Acharya May 03 '18 at 10:13
  • 5
    Could your issue lie in a surprisingly long line in the large file you read? In that case, it might be a good idea to not iterate over complete lines at once. – Martijn May 03 '18 at 10:15
  • have you tried to remove the buffering? BTW can you fix the spelling errors in your question? it's very difficult to read it. – Jean-François Fabre May 03 '18 at 10:15
  • 1
    @Martijn good point. If suddenly a line is 50GB long because it's missing line terminators, then OP's toast. That must be the issue... in that case, reading char by char is the solution, but OP input is probably corrupt instead... – Jean-François Fabre May 03 '18 at 10:16
  • thanks for the suggestions. No there is no such chance since these are data measurements of specific variables. I will still check it though since that would create an issue with my data collection process. – saias May 03 '18 at 10:19
  • I haven't removed buffering yet since I thought the whole idea for using it is to avoid loading the whole file. I will try it to see if anything will change. – saias May 03 '18 at 10:24
  • 1
    You may try to lower the buffer down. 50MB seems pretty high to me. Speaking of buffers, you may want to have write buffers as well. So that you don't open write + close write every single time you want to write down a line. – Samuel GIFFARD May 03 '18 at 10:27
  • ok, im not sure what a write buffer is, I should check that. Regarding the buffering size, why is 50MB high? can someone elaborate on buffering usage or share a link? – saias May 03 '18 at 10:35
  • If your lines are significantly smaller than 50MB, it might be a good idea to use the default line buffering, at least for the debugging stage. This will already lower the memory requirements by a lot. You won't need to hold 50 MB at once in memory after all. – Martijn May 03 '18 at 10:35
  • You might also want to look into the output of [memory-profiler](https://pypi.org/project/memory_profiler/) for your code if the problem is not in your file. – Martijn May 03 '18 at 10:41

1 Answers1

0

Thanks for the responses, it appears that the file had empty characters after a certain point making the line variable to explode as @Martijn suggested!

So I did slice my file! Thanks guys!

saias
  • 406
  • 1
  • 3
  • 12