4

I’m iterating through a file’s lines with enumerate(), and sometimes would need to start the iterating at a specific file line, so I attempted testfile.seek(), e.g. if I want to start iterating the file again at line 10 then testfile.seek(10):

test_file.seek(10)

for i, line in enumerate(test_file):
    …

Yet the test_file always keep iterating starting at the very first line 0. What could I be doing wrong? Why isn’t the seek() working? Any better implementations would be appreciated as well.

Thank you in advance and will be sure to upvote/accept answer

martineau
  • 119,623
  • 25
  • 170
  • 301
Jo Ko
  • 7,225
  • 15
  • 62
  • 120
  • 3
    Doesn't `seek(10)` go to the 10th byte in your file? – Eric Duminil Apr 11 '17 at 20:37
  • Have you read [the docs](https://docs.python.org/3.4/library/io.html#io.TextIOBase.seek) for the seek method? – iafisher Apr 11 '17 at 20:37
  • I think it would be wise to mention that you are concerned with efficiency in particular. that way you might be more likely to get answers talking about linecache / islice which i think is the fastest options. – axwr Apr 11 '17 at 20:41
  • Possible duplicate of [Python fastest access to line in file](http://stackoverflow.com/questions/19189961/python-fastest-access-to-line-in-file) – Mad Physicist Apr 11 '17 at 21:15

5 Answers5

7

Ordinary files are sequences of characters, at the file system level and as far as Python is concerned; there's no low-level way to jump to a particular line. The seek() command counts the offset in bytes, not lines. (In principle, an explicit seek offset should only be used if the file was opened in binary mode. Seeking on a text file is "undefined behavior", since logical characters can take more than one byte.)

Your only option if you want to skip a number of lines is to read and discard them. Since iterating over a file object fetches it one line at a time, a compact way to get your code to work is with itertools.islice():

from itertools import islice

skipped = islice(test_file, 10, None)  # Skip 10 lines, i.e. start at index 10
for i, line in enumerate(skipped, 11):
    print(i, line, end="")
    ...
alexis
  • 48,685
  • 16
  • 101
  • 161
3

A native Python way of doing this would be use zip to iterate over unnecessary lines.

with open("text.txt","r") as test_file:
    for _ in zip(range(10), test_file): pass
    for i, line in enumerate(test_file,start=10):
        print(i, line)
Neil
  • 14,063
  • 3
  • 30
  • 51
2

Personally i would just use an if statement. rudimentary perhaps but it is atleast very easy to understand.

with open("file") as fp:
for i, line in enumerate(fp):
    if i >= 10:
        # do stuff.

Edit: islice: The comparisons done here: Python fastest access to line in file are better than i am capable of. combined with the itertools manual: https://docs.python.org/2/library/itertools.html i doubt you'd need much more

Community
  • 1
  • 1
axwr
  • 2,118
  • 1
  • 16
  • 29
  • but would prefer seek() for optimization. so they it wouldn't need to iterate through unnecessary lines – Jo Ko Apr 11 '17 at 20:36
  • @JoKo Ah if efficiency is a consideration then i would reccomend itertools.islice. Then you don't even need to load the used lines into memory. – axwr Apr 11 '17 at 20:37
  • do you mind showing an example with `itertools.islice` as well? – Jo Ko Apr 11 '17 at 20:41
  • 2
    @Jo Ko: You can't. A line is defined by certain characters, and SOMETHING has to read them to know where they are, unless you've built an external index for your file. – Max Apr 11 '17 at 20:41
  • Yeah, `islice` doesn't prevent each line being loaded into memory, it just lazily iterates over lines. Which so does this solution, but `islice` is for taking *slices*. – juanpa.arrivillaga Apr 11 '17 at 20:52
  • @JoKo. I have a suggestion using the low-level `read` method in my answer. I don't think it's *much* more efficient than just iterating, but you might like it anyway. – Mad Physicist Apr 11 '17 at 21:04
2

The only way the seek method is going to help you is if all the lines in the file are of the same length, which you know ahead of time and your file is either binary or at least ascii-only text (i.e. no variable-width characters allowed). Then you really could do

test_file.seek(10 * (length_of_line + 1), os.SEEK_SET)

This is because seek will move the internal file pointer by a fixed number of bytes, not lines. The +1 above is to account for newline characters. You would likely have to make it +2 on a windows machine using \r\n line terminators.

This will not work if your file is non-ascii because some lines may be the same length in characters but actually contain a different number of bytes, making the call to seek yield undefined results.

There are a few legitimate ways you can skip the first 10 lines:

  1. Read the whole file into a list and discard the first 10 lines:

    with open(...) as test_file:
        test_data = list(test_file)[10:]
    

    Now test_data contains all the lines in your file besides the first 10.

  2. Discard lines from the file as you read it in a for loop using enumerate:

    with open(...) as test_file:
        for lineno, line in test_file:
            if lineno < 10:
                continue
            # Do something with the line
    

    This method has the advantage of not storing the unnecessary lines. This is functionally similar to using itertools.islice as some of the other answers suggest.

  3. Use some really arcane low-level stuff to actually read 10 newline characters from the file before proceeding normally. You may have to specify the encoding of the file up-front for this to work correctly with text I/O, but it should work out-of-the-box for ASCII files (see here for more details):

    newline_count = 10
    with open(..., encoding='utf-8') as test_file:
        while newline_count > 0:
            next_char = test_file.read(1)
            if next_char == '\n':
                newline_count -= 1
        # You have skipped 10 lines, so process normally here.
    

    This option is not particularly robust. It does not handle the case where there are fewer than 10 lines gracefully, and it re-implements the internal machinery of the built-in file iterator very crudely. The only possible advantage it offers is that it does not buffer entire lines like the iterator does.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Unless it is a binary file, `test_file.seek(10 * (length_of_line + 1))` is undefined. From the Python docs: "offset must either be a number returned by `TextIOBase.tell()`, or zero. Any other offset value produces undefined behaviour." – iafisher Apr 11 '17 at 20:59
  • @iafisher. Good catch. Fixed. – Mad Physicist Apr 11 '17 at 21:00
  • I think it is still wrong. The `whence` argument (the second one) defaulted to `os.SEEK_SET` anyway; the problem is that the `offset` argument (the first one), can only be 0 or a value returned by a call to `tell`. This is the same restriction as [in C's fseek function](http://en.cppreference.com/w/c/io/fseek). – iafisher Apr 11 '17 at 21:05
  • @iafisher. You are right. I think that the problem arises for non-ascii text files because even low-level functions like `read(1)` can return a multi-byte character as a single unit. I will add a notation for that similar to the one I did in item #3. – Mad Physicist Apr 11 '17 at 21:06
  • @iafisher. Let me know if you approve of the latest edit. I think it corrects the issue you noticed. – Mad Physicist Apr 11 '17 at 21:09
1

You can't use seek() to get to a beginning of a particular line unless you know the byte-offset of the first character of the desired line.

One simple way to do it would be to use the islice() iterator in the itertools module.

For example, say you had a very big input file that looked like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
...

Sample code:

from __future__ import print_function
from itertools import islice

with open('test_file.txt') as test_file:
    for i, line in enumerate(islice(test_file, 9, None), 10):
        print('line #{}: {}'.format(i, line), end='')

Output:

line #10: 10
line #11: 11
line #12: 12
line #13: 13
line #14: 14
line #15: 15
line #16: 16
line #17: 17
line #18: 18
line #19: 19
line #20: 20
line #21: 21
line #22: 22
...

Note islice() counts from zero, which is why it's first argument was 9 and not 10. Also this is not as fast as seek() would be because islice() actually reads all the lines until it gets to the one where you want to start.

martineau
  • 119,623
  • 25
  • 170
  • 301