3

I have a file sized 15-16GB containing json objects seperated by new line (\n).

I am new to python and reading the file using the following code.

with open(filename,'rb') as file:
  for data in file:  
    dosomething(data)

If while reading the reading ,my script fails after 5GB, how can I resume my read operation from the last read position and continue from there.

I am trying to do the same by using the file.tell() to get position and move the pointer using the seek() function.

Since this file contains json objects, after seek operation am getting the below error.

ValueError: No JSON object could be decoded

I am assuming that after seek operation the pointer is not getting proper json.

How can I solve this?. Is there any other way to read from last read position in python.

Lijo Abraham
  • 841
  • 8
  • 30
  • Could you manually retrieve your current location from the "data" object somehow, and then save this index to a file and read it later. Not posting this as answer as I'm not sure! – dahui Apr 27 '16 at 09:36
  • The best thing to do would to fix your script so it doesn't fail after 5GB. Regardless, the `tell()` and `seek()` combination should work. Update your question and show the code that does this and maybe we can fix it. – martineau Apr 27 '16 at 10:22
  • `with open(filename) as file: file.seek(last_position) for data in file: data = json.loads(data)` and the **json.loads** givng me the error **ValueError: No JSON object could be decoded** and reading file current position using **file.tell()** and i made a hack using following `if data.startswith('{')` ,but this is not good I think – Lijo Abraham Apr 27 '16 at 10:39

2 Answers2

2

Use another file to store the current location:

cur_loc = open("location.txt", "w+")
cur_loc.write('0')
exception = False

i = 0

with open("test.txt","r") as f:
    while(True):
        i+=1
        if exception:
            cur_loc.seek(0)
            pos = int(cur_loc.readline())
            f.seek(pos)
            exception = False

        try:
            read = f.readline()
            print read,
            if i==5:
                print "Exception Happened while reading file!"
                x = 1/0 #to make an exception
            #remove above if block and do everything you want here.
            if read == '':
                break
        except:
            exception = True
            cur_loc.seek(0)
            cur_loc.write(str(f.tell()))

cur_loc.close()

Let assume we have the following text.txt as input file:

#contents of text.txt
1
2
3
4
5
6
7
8
9
10

When you run above program, you will have:

>>> ================================ RESTART ================================
>>> 
1
2
3
4
5
Exception Happened while reading file!
6
7
8
9
10 
>>> 
EbraHim
  • 2,279
  • 2
  • 16
  • 28
  • This code is executing infinitely. `while(True): with open(filename,"r") as f: try: cur_loc.seek(0) pos = int(cur_loc.readline()) f.seek(int(pos)) read = f.readline() print(f.tell()) if not read: break else: dataindex(read) except: cur_loc.seek(0) cur_loc.write(str(f.tell()))` .Please help – Lijo Abraham Apr 27 '16 at 10:46
  • @LijoAbraham Which code? Mine our yours in comment? – EbraHim Apr 27 '16 at 10:47
  • your code. It is repeating the same line always. not going to next line. – Lijo Abraham Apr 27 '16 at 10:56
  • @LijoAbraham Corrected. – EbraHim Apr 27 '16 at 14:00
0

You can use for i, line in enumerate(opened_file) to get the line numbers and store this variable. when your script fails you can display this variable to the user. You will then need to make an optional command line argument for this variable. if the variable is given your script needs to do opened_file.readline() for i in range(variable). this way you will get to the point where you left.

for i in range(passed_variable):
    opened_file.readline()
Sven Hakvoort
  • 3,543
  • 2
  • 17
  • 34
  • I am using like this "data.startswith('{')". Is this fine?? – Lijo Abraham Apr 27 '16 at 09:40
  • Yes, you can also use that when your program knows what the last position was. I don't know what kind of data it is, but in some cases numbers are easier. Depens a bit on what you prefer – Sven Hakvoort Apr 27 '16 at 09:42