6

I'm writing a Python script to read a file, and when I arrive at a section of the file, the final way to read those lines in the section depends on information that's given also in that section. So I found here that I could use something like

fp = open('myfile')
last_pos = fp.tell()
line = fp.readline()
while line != '':
  if line == 'SPECIAL':
  fp.seek(last_pos)
  other_function(fp)
  break
last_pos = fp.tell()
line = fp.readline()

Yet, the structure of my current code is something like the following:

fh = open(filename)

# get generator function and attach None at the end to stop iteration
items = itertools.chain(((lino,line) for lino, line in enumerate(fh, start=1)), (None,))
item = True

  lino, line = next(items)

  # handle special section
  if line.startswith['SPECIAL']:

    start = fh.tell()

    for i in range(specialLines):
      lino, eline = next(items)
      # etc. get the special data I need here

    # try to set the pointer to start to reread the special section  
    fh.seek(start)

    # then reread the special section

But this approach gives the following error:

telling position disabled by next() call

Is there a way to prevent this?

Jonas
  • 121,568
  • 97
  • 310
  • 388
aaragon
  • 2,314
  • 4
  • 26
  • 60

2 Answers2

9

Using the file as an iterator (such as calling next() on it or using it in a for loop) uses an internal buffer; the actual file read position is further along the file and using .tell() will not give you the position of the next line to yield.

If you need to seek back and forth, the solution is not to use next() directly on the file object but use file.readline() only. You can still use an iterator for that, use the two-argument version of iter():

fileobj = open(filename)
fh = iter(fileobj.readline, '')

Calling next() on fileiterator() will invoke fileobj.readline() until that function returns an empty string. In effect, this creates a file iterator that doesn't use the internal buffer.

Demo:

>>> fh = open('example.txt')
>>> fhiter = iter(fh.readline, '')
>>> next(fhiter)
'foo spam eggs\n'
>>> fh.tell()
14
>>> fh.seek(0)
0
>>> next(fhiter)
'foo spam eggs\n'

Note that your enumerate chain can be simplified to:

items = itertools.chain(enumerate(fh, start=1), (None,))

although I am in the dark why you think a (None,) sentinel is needed here; StopIteration will still be raised, albeit one more next() call later.

To read specialLines count lines, use itertools.islice():

for lino, eline in islice(items, specialLines):
    # etc. get the special data I need here

You can just loop directly over fh instead of using an infinite loop and next() calls here too:

with open(filename) as fh:
    enumerated = enumerate(iter(fileobj.readline, ''), start=1):
    for lino, line in enumerated:
        # handle special section
        if line.startswith['SPECIAL']:
            start = fh.tell()

            for lino, eline in islice(items, specialLines):
                # etc. get the special data I need here

            fh.seek(start)

but do note that your line numbers will still increment even when you seek back!

You probably want to refactor your code to not need to re-read sections of your file, however.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks @Martijn. What happens with the enumerator call for getting the line number as well? – aaragon Mar 27 '14 at 13:12
  • @AlejandroMarcosAragon: Your use of `chain()` there is a little.. weird, but it'll work. – Martijn Pieters Mar 27 '14 at 13:14
  • I had to add None at the end, otherwise I had a `StopIteration` exception when I reached the end of the file. I still can't make the `iter` take the enumerate for the lino. – aaragon Mar 27 '14 at 13:19
  • @AlejandroMarcosAragon: You can ask `next()` to return `None` instead when you reach the end; `next(items, None)`. You appear to have an off-by-one error, as all you did is postpone the `StopIteration` by one more call. – Martijn Pieters Mar 27 '14 at 13:21
  • Alright, this is what I have so far. I tried: `fit = enumerate(fh, start=1) try: while True: lino, line = next(fit, None)` but this gives me a `TypeError: 'NoneType' object is not iterable`. So I enclosed the call to next in a try to break the loop. Can I improve it further? – aaragon Mar 27 '14 at 13:31
  • I'm still getting the error `telling position disabled by next() call` making the changes and I'm now using an iterator. – aaragon Mar 27 '14 at 13:42
  • @AlejandroMarcosAragon: make sure you are not calling `next()` on the actual file handle somewhere still, *only* use the `iter(fileobj.readline, '')` object to call `next()` on, and call `tell()` on the unwrapped file object. – Martijn Pieters Mar 27 '14 at 13:44
  • In the input file, every line in the special section defines an object of certain type. So I need to know by the end of the special section, how many objects of which type I have in order to create a dictionary so that I can construct later numpy arrays of the right size. – aaragon Mar 27 '14 at 13:51
  • @AlejandroMarcosAragon: Then why not track that in a counter somewhere, and / or store the objects parsed so far? – Martijn Pieters Mar 27 '14 at 13:54
  • But isn't that more inefficient? A huge file can have millions of these lines. – aaragon Mar 27 '14 at 14:04
  • It sounds as if you are *already* storing the information in memory somewhere. – Martijn Pieters Mar 27 '14 at 14:32
  • I will do the test, I'll create a very big input file and I'll try to read it both ways to check the difference in speed and I'll let you know. – aaragon Mar 27 '14 at 14:51
  • So I did the test with both cases, in the other case I use io.StringIO to write the file that I already read. I turns out the tell, seek approach is faster (1m16.911s compared to 1m28.933s for 3973781 of those lines). – aaragon Mar 27 '14 at 17:12
  • I didn't mean for you to use an in-memory file object. It isn't clear to me why you need to parse the same section of text twice each time still. – Martijn Pieters Mar 27 '14 at 17:19
  • I'll explain better. That section contains as many lines as elements (objects that need to be created). But information about these objects is stored in numpy arrays. The moment I start parsing the first time, I have no idea about how many types of different elements I have and how many elements for each type so I can't create the numpy arrays in advance. So by the end of the first pass, since I have this info, I am create the arrays. In the second pass I assign the data of the elements to the arrays. – aaragon Mar 27 '14 at 17:38
  • @AlejandroMarcosAragon: and you cannot put the data you'll put in the arrays, into Python lists first? – Martijn Pieters Mar 27 '14 at 17:41
  • I could, but would that be more efficient than what I did with a SringIO? – aaragon Mar 27 '14 at 17:45
  • I don't know, I don't know your data. – Martijn Pieters Mar 27 '14 at 17:45
  • 1
    Less looping, less I/O. – Martijn Pieters Mar 27 '14 at 18:26
1

I'm not an expert with version 3 of Python, but it seems like you're reading using generator that yields lines that are read from file. Thus you can have only one-side direction.

You'll have to use another approach.

Andrew Dunai
  • 3,061
  • 19
  • 27