0

This is an issue of trying to reach to the line to start from and proceed from there in the shortest time possible.

I have a huge text file that I'm reading and performing operations line after line. I am currently keeping track of the line number that i have parsed so that in case of any system crash I know how much I'm done with.

How do I restart reading a file from the point if I don't want to start over from the beginning again.

count = 0
all_parsed = os.listdir("urltextdir/")
with open(filename,"r") as readfile :
     for eachurl in readfile:
         if str(count)+".txt" not in all_parsed:
             urltext = getURLText(eachurl)
             with open("urltextdir/"+str(count)+".txt","w") as writefile:
                 writefile.write(urltext)
             result = processUrlText(urltext)
             saveinDB(result)

This is what I'm currently doing, but when it crashes at a million lines, I'm having to through all these lines in the file to reach the point I want to start from, my Other alternative is to use readlines and load the entire file in memory.

Is there an alternative that I can consider.

Kenstars
  • 662
  • 4
  • 11
  • Possible duplicate of [Reading specific lines only (Python)](https://stackoverflow.com/questions/2081836/reading-specific-lines-only-python) – Edwin van Mierlo Apr 19 '18 at 10:02
  • Looked through that one, couldn't find a solution that hasn't been addressed in the question., Most of them seems to provide a solution which iterates from the first line. On a data scale of that size it is not solving the problem. – Kenstars Apr 19 '18 at 10:04

3 Answers3

1

Unfortunately line number isn't really a basic position for file objects, and the special seeking/telling functions are ruined by next, which is called in your loop. You can't jump to a line, but you can to a byte position. So one way would be:

line = readfile.readline()
while line:
    line = readfile.readline(): #Must use `readline`!
    lastell = readfile.tell()
    print(lastell) #This is the location of the imaginary cursor in the file after reading the line 
    print(line) #Do with line what you would normally do
print(line) #Last line skipped by loop

Now you can easily jump back with

readfile.seek(lastell) #You need to keep the last lastell)

You would need to keep saving lastell to a file or printing it so on restart you know which byte you're starting at.

Unfortunately you can't use the written file for this, as any modification to the character amount will ruin a count based on this.

Here is one full implementation. Create a file called tell and put 0 inside of it, and then you can run:

with open('tell','r+') as tfd:
    with open('abcdefg') as fd:
        fd.seek(int(tfd.readline()))         #Get last position
        line = fd.readline()                 #Init loop
        while line:
            print(line.strip(),fd.tell())    #Action on line
            tfd.seek(0)                      #Clear and
            tfd.write(str(fd.tell()))        #write new position only if successful
            line = fd.readline()             #Advance loop
        print(line)                          #Last line will be skipped by loop

You can check if such a file exists and create it in the program of course.

As @Edwin pointed out in the comments, you may want to fd.flush() and os.fsync(fd.fileno) (import os if that isn't clear) to make sure after every write you file contents are actually on disk - this would apply to both write operations you are doing, the tell the quicker of the two of course. This may slow things down considerably for you, so if you are satisfied with the synchronicity as is, do not use that, or only flush the tfd. You can also specify the buffer when calling open size so Python automatically flushes faster, as detailed in https://stackoverflow.com/a/3168436/6881240.

kabanus
  • 24,623
  • 6
  • 41
  • 74
  • I'll try this out. Thanks a lot. – Kenstars Apr 19 '18 at 10:10
  • @Kenstars I'll flesh this out in a bit, I already see I made a mistake. – kabanus Apr 19 '18 at 10:21
  • as the OP is mentioning "system crash" this could mean that cache/buffers are also not written to disk. Hence I would suggest that you do a `tfd.flush()` and `os.fsync(tfd.fileno())` after `tfd.write(str(fd.tell()))` to ensure the "tell" is written to file in case of an unplanned system crash – Edwin van Mierlo Apr 19 '18 at 11:25
0

If I got it right, You could make a simple log file to store the count in.

but still would would recommand to use many files or store every line or paragraph in a database le sql or mongoDB

Peko Chan
  • 306
  • 5
  • 15
  • Thanks @Peko Chan and I am keeping track of the count. But I am not able to start from that line again. If the count is at 50000, I am having to iterate from line 1 all the way to line 50000 checking everytime if I have reached my count before I can proceed forward. – Kenstars Apr 19 '18 at 10:08
  • that's why I would consider using database, it will be safer, then you would just ave tout query something like `line.add(query={'line':line_count+1}, update={$set:{'text':'new text entry'}})` – Peko Chan Apr 19 '18 at 10:19
0

I guess it depends on what system your script is running on, and what resources (such as memory) you have available.

But with the popular saying "memory is cheap", you can simply read the file into memory.

As a test, I created a file with 2 million lines, each line 1024 characters long with the following code:

ms = 'a' * 1024
with open('c:\\test\\2G.txt', 'w') as out:
    for _ in range(0, 2000000):
        out.write(ms+'\n')

This resulted in a 2 GB file on disk.

I then read the file into a list in memory, like so:

my_file_as_list = [a for a in open('c:\\test\\2G.txt', 'r').readlines()]

I checked the python process, and it used a little over 2 GB in memory (on a 32 GB system) Access to the data was very fast, and can be done by list slicing methods.

You need to keep track of the index of the list, when your system crashes, you can start from that index again.

But more important... if your system is "crashing" then you need to find out why it is crashing... surely a couple of million lines of data is not a reason to crash anymore these days...

Edwin van Mierlo
  • 2,398
  • 1
  • 10
  • 19