2

I am currently using a C++ script with a Python wrapper for manipulating a larger (15 GB) text file line-by-line. Effectively what it does is it reads a line from input.txt, processes it, the outputs the result to output.txt. I am using the straigtforward loop here (inp being opened as input.txt, out being opened as output.txt):

for line in inp:
    result = operate(line)
    out.write(result)

However, because of the C++ script's issues, it has some failure rate, which causes the loop to shut after about ten million iterations. This leaves me with an output file made using only like 10% of the input.

Since I have no means of fixing the original script, I thought about just restarting it where it stopped. I counted the lines of output.txt, made another called output2.txt, and started the following code:

k = 0
for line in inp:
    if k < 12123253:
        k + = 1
    else:
        result = operate(line)
        out2.write(result)
        k + = 1

However, compared to when I was counting the lines, which ended under a minute, this method takes long hours to get to the designated line.

Why is this method inefficient? Is there a faster one? I am on a Windows pc with a strong calculating capability (72GB RAM, good processors), and using python 2.7.

  • I think tell (to record where you were) and seek (to return to that point in your next run) could probably help you out. http://stackoverflow.com/questions/3299213/python-how-can-i-open-a-file-and-specify-the-offset-in-bytes – Jacques de Hooge Apr 13 '16 at 08:46

2 Answers2

5

I suggest you to use itertools

with open(inp) as f:
    result = itertools.islice(f, start_line, None)
    for i in result:
        #do something with this line
Francesco
  • 4,052
  • 2
  • 21
  • 29
1

you may use file.seek and file.tell. Below is the sample (pseudo) code:

def seralizebreakpoint(pos):
    pass

def desearializebreakpoint():
    '''return -1 if there is actually no break point'''
    pass

def process(inp):

    pos = inp.tell()
    for line in inp:
        try:
            result = operate(line)
            pos = inp.tell()            
        except:
            seralizebreakpoint(pos)
            raise

def processEntry(pathtoinput):

    bp = desearializebreakpoint() 
    with open(pathtoinput, 'r') as inp:
        if bp > -1:
            inp.seek(bp)
        process(inp)
Lei Shi
  • 757
  • 4
  • 8