processing speed - editing large 2GB text file python

Question

So I have a problem. I am working with .txt files which are comprised of multiple of 4 lines. I am working in python 3.

I wrote a code that is meant to take every 2nd and 4th line of a text file and keep only the first 20 characters of those two lines (while leaving the 1st and 3rd line unedited), and create a new edited file comprising of the the edited 2nd and 4th line and the unedited 1st and 3rd line. This trend would be the same for every line since all the text files I work with have line numbers that are always multiple of 4.

This works on small files (~100 lines total) but the files I need edition are 50 million+ lines and it is taking 4+ hours.

Below is my code. Can anyone give me a suggestion on how to speed up my program? Thanks!

import io
import os
import sys

newData = ""
i=0
run=0
j=0
k=1
m=2
n=3
seqFile = open('temp100.txt', 'r')
seqData = seqFile.readlines()
while i < 14371315:
    sLine1 = seqData[j] 
    editLine2 = seqData[k]
    sLine3 = seqData[m]
    editLine4 = seqData[n]
    tempLine1 = editLine2[0:20]
    tempLine2 = editLine4[0:20]
    newLine1 = editLine2.replace(editLine2, tempLine1)
    newLine2 = editLine4.replace(editLine4, tempLine2)
    newData = newData + sLine1 + newLine1 + '\n' + sLine3 + newLine2
    if len(seqData[k]) > 20:
         newData += '\n'
    i=i+1
    run=run+1
    j=j+4
    k=k+4
    m=m+4
    n=n+4
    print(run)

seqFile.close()

new = open("new_100temp.txt", "w")
sys.stdout = new
print(newData)

Do you really need python? `awk 'NR%2==0{$0=substr($0,1,20)}1' old.txt >new.txt`? — Kevin, Oct 20 '13 at 18:11
For future reference, be advised that `.readlines()` loads the file completely and puts its content into an array (line as item), which, for 2GB files can be expensive. Good practice is considered to iterate the file instead of loading it all. — vonPetrushev, Oct 20 '13 at 21:32

score 2 · Answer 1 · answered Oct 20 '13 at 18:20

You are working with the two files (input and output) in memory. It can cause time problems if files are big (pagination). Try (pseudocode)

Open input file for read
Open output file for write
Initialize counter to 1
While not EOF in input file
    Read input line
    If counter is odd 
        Write line to output file
    Else
        Write 20 first characters of line to output file
    Increment counter
Close files

score 2 · Answer 2 · answered Oct 20 '13 at 18:22

The biggest issue here seems to be reading the whole file at once:

seqData = seqFile.readlines()

Instead you should open you source file and output file at first. Then iterate over the first file and manipulate the lines as you wish:

outfile = open('output.txt', 'w')
infile = open('input.txt', 'r')

i = 0
for line in infile:
    if i % 2 == 0:
       newline = line
    else:
       newline = line[:20]

    outfile.write( newline )
    i += 1

outfile.close()
infile.close()

score 2 · Accepted Answer · edited May 23 '17 at 12:28

2

It is probably much faster if you just read 4 lines at a time and process those (untested):

with open('100temp.txt') as in_file, open('new_100temp.txt', 'w') as out_file:
    for line1, line2, line3, line4 in grouper(in_file, 4):
         # modify 4 lines
         out_file.writelines([line1, line2, line3, line4])

where grouper(it, n) is a function that yields n items of an iterabel it at a time. It is given as one of the examples of the itertools module (see also this anwer at SO). Iterating over a file in this way is similar to calling readlines() on a file and then manually iterating over the resulting list, but it only reads a few lines into memory at a time.

edited May 23 '17 at 12:28

Community

1
1

answered Oct 20 '13 at 18:25

Bas Swinckels

18,095
3
45
62

If this is the top answer, you should use `with` instead of manually opening and closing the files. – kevinsa5 Oct 20 '13 at 23:46
Thanks for the suggestion, the solution with `with` it indeed much cleaner. – Bas Swinckels Oct 21 '13 at 08:30

score 1 · Answer 4 · answered Oct 20 '13 at 18:27

See the docs for the best way to read a file. Instead of keeping it all in memory, which is what you're doing with seqData = seqFile.readlines(), just iterate through. Python takes care of buffering et al. for you, so it is fast and efficient. Also, you shouldn't open and close files yourself (like the other answers)-- use the with keyword.

lineCount = 0
with open("new_100temp.txt", "w") as newFile, open("100temp.txt","r") as oldFile:
    for line in oldFile:
        #start on line 1, keep 1st and 3rd as is, modify 2nd and 4th
        lineCount += 1
        if lineCount%4 == 1 or lineCount%4 == 3: 
            newFile.write(line)
        else:
            newFile.write(line[:20] + "\n")
            # printing is really slow, so only do it every 100th iteration:
        if lineCount % 100 == 0:
            print lineCount

I just tried it on one million lines of garbage text and it finished it in less than a second. As Kevin said though, simple text jobs like this are good for the shell to handle.

processing speed - editing large 2GB text file python

4 Answers4