Process very large (>20GB) text file line by line

Question

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the last three characters from each of the first three fields - which should reduce the file size by about 20%.

I am brand new to Python and have a code which will do what I want to do at about 3.4 GB per hour, but to be a worthwhile exercise I really need to be getting at least 10 GB/hr - is there any way to speed this up? This code doesn't come close to challenging my processor, so I am making an uneducated guess that it is limited by the read and write speed to the internal hard drive?

def ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    l = r.readline()
    while l:
        x = l.split(' ')[0]
        y = l.split(' ')[1]
        z = l.split(' ')[2]
        w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
        l = r.readline()
    r.close()
    w.close()

Any help would be really appreciated. I am using the IDLE Python GUI on Windows 7 and have 16GB of memory - perhaps a different OS would be more efficient?.

Edit: Here is an extract of the file to be processed.

70700.642014 31207.277115 -0.054123 -1585 255 255 255
70512.301468 31227.990799 -0.255600 -1655 155 158 158
70515.727097 31223.828659 -0.066727 -1734 191 187 180
70566.756699 31217.065598 -0.205673 -1727 254 255 255
70566.695938 31218.030807 -0.047928 -1689 249 251 249
70536.117874 31227.837662 -0.033096 -1548 251 252 252
70536.773270 31212.970322 -0.115891 -1434 155 158 163
70533.530777 31215.270828 -0.154770 -1550 148 152 156
70533.555923 31215.341599 -0.138809 -1480 150 154 158

If you are writing in Python 2.7, you could try running on [PyPy](http://pypy.org/). The just-in-time compiler could give you performance speedup on your field shuffling, though I'm not sure how much that would help if the filesystem is the bottleneck. — pcurry, May 21 '13 at 12:20

John La Rooy · Answer 1 · 2013-05-21T12:51:17.177

32

It's more idiomatic to write your code like this

def ProcessLargeTextFile():
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

suggested by @Janne, an alternative way to generate the lines

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z, rest = line.split(' ', 3)
            bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

edited May 21 '13 at 12:51

answered May 21 '13 at 12:09

John La Rooy

295,403
53
369
502

if the lines are of constant size, you could try reading/writing the file in larger chunks... – root May 21 '13 at 12:13
@root Shouldn't the `for` stuff do the buffering in that (and other) case(s)? – glglgl May 21 '13 at 12:14
@glglgl -- well it could make it possible to do the replace operations on thousands of lines at the time... (not sure which way would be the fastest - maybe a regex?) – root May 21 '13 at 12:23
@root, the replacements are different per line. Anyway the OP doesn't seem to be CPU bound – John La Rooy May 21 '13 at 12:28
If I understood the requirements, you could use `write(x[:-3]+' '+y[:-3]+' '+z[:-3]+'\n')` instead of the `replace` chain. – Janne Karila May 21 '13 at 12:30
@Janne, perhaps...i'll add that to my answer – John La Rooy May 21 '13 at 12:32
Of course it doesn't solve the IO problem but `re.sub('(?<=\.\d{3})\d{3}', '', r.read(54*100000))` gives a 20% speedup and 30 gb/h on a beefy ubuntu machine... – root May 21 '13 at 21:59

score 13 · Answer 2 · answered May 21 '13 at 12:43

Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:

Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
Some async io library (Twisted?) might help too.

If you figured out the exact problem, ask again for optimizations of that problem.

score 10 · Answer 3 · answered May 21 '13 at 13:42

As you don't seem to be limited by CPU, but rather by I/O, have you tried with some variations on the third parameter of open?

Indeed, this third parameter can be used to give the buffer size to be used for file operations!

Simply writing open( "filepath", "r", 16777216 ) will use 16 MB buffers when reading from the file. It must help.

Use the same for the output file, and measure/compare with identical file for the rest.

Note: This is the same kind of optimization suggested by other, but you can gain it here for free, without changing your code, without having to buffer yourself.

score 9 · Answer 4 · edited May 23 '17 at 10:30

I'll add this answer to explain why buffering makes sense and also offer one more solution

You are getting breathtakingly bad performance. This article Is it possible to speed-up python IO? shows that a 10 gb read should take in the neighborhood of 3 minutes. Sequential write is the same speed. So you're missing a factor of 30 and your performance target is still 10 times slower than what ought to be possible.

Almost certainly this kind of disparity lies in the number of head seeks the disk is doing. A head seek takes milliseconds. A single seek corresponds to several megabytes of sequential read-write. Enormously expensive. Copy operations on the same disk require seeking between input and output. As has been stated, one way to reduce seeks is to buffer in such a way that many megabytes are read before writing to disk and vice versa. If you can convince the python io system to do this, great. Otherwise you can read and process lines into a string array and then write after perhaps 50 mb of output are ready. This size means a seek will induce a <10% performance hit with respect to the data transfer itself.

The other very simple way to eliminate seeks between input and output files altogether is to use a machine with two physical disks and fully separate io channels for each. Input from one. Output to other. If you're doing lots of big file transformations, it's good to have a machine with this feature.

Iyvin Jose · Answer 5 · 2018-08-20T10:00:50.260

Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files. It will run smoothly on any kind of machine, you just need to configure CHUNK_SIZE based on your system RAM. More the CHUNK_SIZE, more will be the data read at a time

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

download the file data_loading_utils.py and import it into your code

usage

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(line, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=process_lines)

process_lines method is the callback function. It will be called for all the lines, with parameter line representing one single line of the file at a time.

You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.

I am trying to use your code, but getting an error that `NameError: name 'self' is not defined.` In this case, what object is `self` referring to? Thanks! — horcle_buzz, Aug 18 '18 at 15:37
@horcle_buzz. apologies for the error raised. I have updated the code. please check — Iyvin Jose, Aug 20 '18 at 10:01

score 5 · Answer 6 · edited May 21 '13 at 13:20

ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    l = r.readline()
    while l:

As has been suggested already, you may want to use a for loop to make this more optimal.

    x = l.split(' ')[0]
    y = l.split(' ')[1]
    z = l.split(' ')[2]

You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.

    w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:

BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory

def ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    buf = ""
    bufLines = 0
    for lineIn in r:

        x, y, z = lineIn.split(' ')[:3]
        lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
        bufLines+=1

        if bufLines >= BUFFER_SIZE:
            # Flush buffer to disk
            w.write(buf)
            buf = ""
            bufLines=1

        buf += lineOut + "\n"

    # Flush remaining buffer to disk
    w.write(buf)
    buf.close()
    r.close()
    w.close()

You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.

score 4 · Answer 7 · answered May 21 '13 at 12:14

Your code is rather un-idiomatic and makes far more function calls than needed. A simpler version is:

ProcessLargeTextFile():
    with open("filepath") as r, open("output") as w:
        for line in r:
            fields = line.split(' ')
            fields[0:2] = [fields[0][:-3], 
                           fields[1][:-3],
                           fields[2][:-3]]
            w.write(' '.join(fields))

and I don't know of a modern filesystem that is slower than Windows. Since it appears you are using these huge data files as databases, have you considered using a real database?

Finally, if you are just interested in reducing file size, have you considered compressing / zipping the files?

score 3 · Answer 8 · answered May 21 '13 at 12:07

3

Read the file using for l in r: to benefit from buffering.

answered May 21 '13 at 12:07

Janne Karila

24,266
6
53
94

craastad · Answer 9 · 2013-05-21T12:55:15.963

Those seem like very large files... Why are they so large? What processing are you doing per line? Why not use a database with some map reduce calls (if appropriate) or simple operations of the data? The point of a database is to abstract the handling and management large amounts of data that can't all fit in memory.

You can start to play with the idea with sqlite3 which just uses flat files as databases. If you find the idea useful then upgrade to something a little more robust and versatile like postgresql.

Create a database

 conn = sqlite3.connect('pts.db')
 c = conn.cursor()

Creates a table

c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')

Then use one of the algorithms above to insert all the lines and points in the database by calling

c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")

Now how you use it depends on what you want to do. For example to work with all the points in a file by doing a query

c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")

And get n lines at a time from this query with

c.fetchmany(size=n)

I'm sure there is a better wrapper for the sql statements somewhere, but you get the idea.

Thanks Chris, the files are .PTS files for point cloud information. Each row represents a different point in space in Cartesian coordinates and this is the format we get the data from the supplier and what our software requires. — Tom_b, May 21 '13 at 12:21
In 3D space? Does the order of data matter? And how does you software use the data? — craastad, May 21 '13 at 12:37
@ChrisRaastad: Did Tom_b ask for help refactoring the system being used or improving the code that was provided? — Noctis Skytower, May 21 '13 at 14:24

score 2 · Answer 10 · answered May 21 '13 at 12:03

2

You can try to save your split result first you do it and not do it every time you need a field. May be this will speed up.

you can also try not to run it in gui. Run it in cmd.

answered May 21 '13 at 12:03

Muetze

21
2

score 2 · Answer 11 · answered Nov 13 '13 at 01:14

Since you only mention saving space as a benefit, is there some reason you can't just store the files gzipped? That should save 70% and up on this data. Or consider getting NTFS to compress the files if random access is still important. You'll get much more dramatic savings on I/O time after either of those.

More importantly, where is your data that you're getting only 3.4GB/hr? That's down around USBv1 speeds.

Process very large (>20GB) text file line by line

11 Answers11

Linked