1

I am running the following code in python 3 to take in a .txt file, edit every second line, and store the edited .txt file. It works great for small files, but my files are ~2GB and it takes much too long.

Does anyone have any suggestions on how to alter the code for more efficiency and speed?

newData = ""
i=0
run=0
j=0
k=1
seqFile = open('temp100.txt', 'r')
seqData = seqFile.readlines()
while i < 14371315:
    sLine = seqData[j] 
    editLine = seqData[k]
    tempLine = editLine[0:20]
    newLine = editLine.replace(editLine, tempLine)
    newData = newData + sLine + newLine
    if len(seqData[k]) > 20:
        newData += '\n'
i=i+1
j=j+2
k=k+2
run=run+1
print(run)

seqFile.close()

new = open("new_temp100.txt", "w")
sys.stdout = new
print(newData)
Tom Anonymous
  • 173
  • 1
  • 3
  • 11

1 Answers1

1

I would suggest something like this:

# if python 2.x
#from itertools import tee, izip
# if python 3
from itertols import tee
# http://docs.python.org/2/library/itertools.html#recipes
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    # if python 2.x
    #return izip(a, b)
    return zip(a, b)

new_data = []
with open('temp100.txt', 'r') as sqFile:
    for sLine, edit_line  in pairwise(seqFile):
        # I think this is just new_line = tempLine
        #tempLine = edit_line[:20]
        #new_line = editLine.replace(editLine, tempLine)
        new_data.append(sLine + editLine[:20])
        if len(sLine) > 20:
            new_data.append('\n')



with open("new_temp100.txt", "w") as new:
    new.write(''.join(new_data))

you can probably do better if you just stream directly to disk

# if python 2.x
#from itertools import tee, izip
# if python 3
from itertols import tee
# http://docs.python.org/2/library/itertools.html#recipes
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    # if python 2.x
    #return izip(a, b)
    return zip(a, b)

new_data = []
with open('temp100.txt', 'r') as sqFile:
    with open("new_temp100.txt", "w") as new:
        for sLine, edit_line  in pairwise(seqFile):
            tmp_str = sLine + editLine[:20]
            if len(sLine) > 20:
                tmp_str = tmp_str + '/n'
            new.write(tmp_str)

so you don't have to hold the full contents of the file into memory

tacaswell
  • 84,579
  • 22
  • 210
  • 199
  • why open a file for reading add stuff to a string and then write it all to a file? You could do both at the same time by nesting the two open calls -- then you just write each resultant line at a time. Think this would be quicker. – Tim Diggins Oct 19 '13 at 21:19
  • @TimDiggins because that is what the OP does (by resetting `sys.stdout = new`) – tacaswell Oct 19 '13 at 21:20
  • hmm, interesting. But I am getting the following error: from itertools import izip ImportError: cannot import name izip – Tom Anonymous Oct 19 '13 at 22:00
  • @TomAnonymous See edits, you are using python 3 where `zip` behaves as `izip` from python2. I use python2, and forget these things. – tacaswell Oct 19 '13 at 22:05
  • Ok, I applied the edit and it fixed the Error issue. The output text is coming off out of order, but I can work on that myself. Thanks again! – Tom Anonymous Oct 19 '13 at 22:13
  • Sorry, I was having trouble understanding what you were doing in the loop body, I apparently simplified it incorrectly. – tacaswell Oct 19 '13 at 22:18