2

I'm writing a python script in which I read a big file ~5 GB line by line, make some modifications in each of the lines, and then write it to another file.

When I use the function file.readlines() for reading the input file, my disk usage reaches ~90% and the disk speed reaches +100Mbps (i know this method shouldn't be used for large files).

I haven't measured the program execution time for the above case as my system becomes unresponsive (the memory gets full).

When I use an iterator like below (And this is what I'm actually using in my code)

with open('file.csv', 'r') as inFile:
    for line in inFile:

My disk usage remains < 10% and speed are < 5 Mbps and it takes ~20 minutes for the program to finish execution for a 5 GB file. Wouldn't this time be lower if my disk usage was high?

Also, does it really take ~20 minutes to read a 5 GB, process it line by line making some modifications on each line and finally writing it to a new file or am I doing something wrong?

What I can't figure out is why doesn't the program use my system to its full potential when performing the io operations. Because if it did, then my disk usage should have been higher, right?.

noobcoder
  • 83
  • 7
  • 2
    I suspect that some of the disk usage is virtual ram as it can't hold the whole file in the memory. This would significantly increase disk usage but make the operation slower. – Artyer Jun 19 '17 at 14:02
  • My system's memory size is 8 GB so i think it can actually load the entire file into the memory. Not sure though. – noobcoder Jun 19 '17 at 14:04
  • Your disk usage will have two parts: you are reading the original file, but also, when you RAM gets to the limits, parts of the RAM are swapped to disk to 'extend' it. This is particularly heavy disk usage and you should try to avoid it - for example by processing the data you read as it comes in, and immediately free the memory. You may have much RAM, but storage in memory is generally less optimal. Also, the OS reserves a large part of the RAM, as does the compiler. – jcoppens Jun 19 '17 at 14:06
  • Eg. I have 8 GB of RAM, at the moment, and have only 0.5 GB free (and I'm not running any large program!). Try to see the free RAM in your machine (In linux use `free`) – jcoppens Jun 19 '17 at 14:12
  • With 10 Chrome tabs open, I've got 4.7 GB of available RAM. – noobcoder Jun 19 '17 at 14:15
  • Please share the code that performs the "modifications" you mentioned. I think the problem is there. – Miguel Ortiz Jun 21 '17 at 00:15

4 Answers4

1

I don't think your main problem is reading the file because you're using open(), instead I would check what you are doing here:

make some modifications in each of the lines, and then write it to another file.

So, try reading the file without making modifications / writting modifications to another file to find out how much it takes for your system to just read the file.

Here's how I tested in my environment after reading this,this,this and this

First, created a 1.2GB file:

timeout 5 yes "Ergnomic systems for c@ts that works too much" >> foo

I didn't use dd or truncate, that would lead to Memory Errors while reading the files.

Now some I/O testing reading the file, this is an already optimized operation like @Serge Ballesta mentioned:

#!/usr/bin/python
with open('foo') as infile:
    for line in infile:
        pass
    print 'file readed'

$ time python io_test.py
file readed

real    0m2.647s
user    0m2.343s
sys     0m0.327s

Changing buffering options with open():

# --------------------------------------NO BUFFERING
with open('foo','r',0) as infile:
    for line in infile:
        pass
    print 'file readed'

$ time python io_test.py
file readed

real    0m2.787s
user    0m2.406s
sys     0m0.374s

# --------------------------------------ONE LINE BUFFERED
with open('foo','r',1) as infile:
  for line in infile:
    pass
  print 'file readed' 

$ time python io_test.py
file readed

real    0m4.331s
user    0m2.468s
sys     0m1.811s
# -------------------------------------- 70 MB/s
with open('foo','r',700000000) as infile:
  for line in infile:
    pass
  print 'file readed' 

$ time python io_test.py
file readed

real    0m3.137s
user    0m2.311s
sys     0m0.827s

Why you should not use readlines:

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

$ time python io_test.py

real    0m6.428s
user    0m3.858s
sys     0m2.499s

Miguel Ortiz
  • 1,412
  • 9
  • 21
  • Thanks for the link. I'm not actually using .readlines() in my code. I used it in the question to compare it with my code. What I can't figure out is why doesn't the program use my system to its potential when performing the io operations because if it did, then my disk usage should have been higher, right?. – noobcoder Jun 19 '17 at 14:22
  • For a 1 GB file, it took around ~13 seconds for my system to load it. – noobcoder Jun 20 '17 at 07:25
  • When I wrote the data being read from the input file into another file, the execution time was ~1 minute (for both reading and writing). This means reading and writing from a 5 GB file should take ~5 minutes. But it's taking ~20 minutes for my program (which reads data from a 5 GB file line by line, performs some modifications to it and then writes it to another file) to finish execution. – noobcoder Jun 20 '17 at 07:48
  • Yes @noobcoder the thing is the operations to "modify" the file which I think could be the poor performance on your script. Try my suggestion, use "pass" instead your modifying operations and we'll know if there's or not the issue. As you said it tooks 1 minute for you to do reading and writting. – Miguel Ortiz Jun 21 '17 at 00:14
  • I had already tried and reported the results above for 'pass'. It took ~ 13 seconds for a 1 GB file. After a bit of code optimization, using batches to write files into the disk and most importantly, changing my laptop into high-performance mode the execution time for my script came down to ~8 minutes. – noobcoder Jun 22 '17 at 06:58
1

Reading a file by line in Python is already an optimized operation: Python loads an internal buffer from the disk and gives it in lines to the caller. That means that the line identification is already done in memory by the Python interpretor.

Normally, a processing can be disk IO bound, when disk access is the limiting factor, memory bound or processor bound. If you use some network, it can be network IO bound or remote server bound, still depending on what is the limiting factor. As you process the file by line, it is quite unlikely for the process to be memory bound. To make sure whether the disk IO is the limiting part, you could try to simply copy the file with the system copy utility. If it takes about 20 minutes, then the process is disk IO bound, if it is much quicker then the modification on the lines cannot be neglected.

Anyway, loading a big file in memory is always a bad idea...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

It simply depends on the size buffer you use for reading the file.

Lets look at an example:

You have a file which contains 20 characters.

Your buffer size is 2 characters.

Then you have to use at least 10 system calls for reading the entire time.

A system call is a very expensive operation because the kernel has to switch the executing context.

If you have a buffer which is 20 characters in size you just need 1 system call and therefore only one kernel trap is nescessary.

I assume that the first function simply uses a bigger buffer internally.

jcoppens
  • 5,306
  • 6
  • 27
  • 47
Appyx
  • 1,145
  • 1
  • 12
  • 21
  • The first function loads the entire file into the memory actually. The second function uses a buffer. I've tried giving various buffer sizes to it but that did not seem to improve the performance. – noobcoder Jun 19 '17 at 14:08
  • The first function has to use a buffer too. But its a really big one. – Appyx Jun 19 '17 at 14:11
  • 1
    Any buffer size you give the funtion that is longer than the line itself has no effect on performance because after the newline character the rest of the buffer will not be filled. So you still use the same amount of system calls. – Appyx Jun 19 '17 at 14:15
  • That explains why a higher buffer value didn't affect the performance. I was under the impression that a higher buffer will make the program load in more lines of the file into the memory. – noobcoder Jun 19 '17 at 14:25
  • The only solition for performance is to load everything with a big buffer and edit the lines in memory. – Appyx Jun 19 '17 at 14:27
  • Doesn't it contradict to what you mentioned above? Even if I give a higher buffer value, the program would only read a single line (each row in the input file has a newline character at the end) – noobcoder Jun 19 '17 at 14:29
  • I ment you should read the file with the first method or with another read function which loads everything into ram and then edit the lines. – Appyx Jun 19 '17 at 14:35
0

You not only need RAM for the file, but also for input- and output-buffers and a 2nd copy of your modified file. This is easily overwhelming your RAM. If you do not want to read, modify write each single line in a for loop, you may want to group some lines together. This will probably make reading/writing faster, but at the cost of some more algorithmic overhead. At the end of the day I'd use the line-by-line approach. HTH! LuI

LuI
  • 1
  • 2