5

I'm trying to "map" a very large ascii file. Basically I read lines until I find a certain tag and then I want to know the position of that tag so that I can seek to it again later to pull out the associated data.

from itertools import dropwhile
with open(datafile) as fin:
    ifin = dropwhile(lambda x:not x.startswith('Foo'), fin)
    header = next(ifin)
    position = fin.tell()

Now this tell doesn't give me the right position. This question has been asked in various forms before. The reason is presumably because python is buffering the file object. So, python is telling me where it's file-pointer is, not where my file pointer is. I don't want to turn off this buffering ... The performance here is important. However, it would be nice to know if there is a way to determine how many bytes python chooses to buffer. In my actual application, as long as I'm close the the lines which start with Foo, it doesn't matter. I can drop a few lines here and there. So, what I'm actually planning on doing is something like:

position = fin.tell() - buffer_size(fin)

Is there any way to go about finding the buffer size?

Community
  • 1
  • 1
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • Rather than using ftell() here, I would total up the lengths of the lines you're skipping. – Russell Borogove Apr 12 '13 at 20:27
  • @RussellBorogove -- That's a reasonable approach which I had originally thought about, but the downside is that then I'd need to assume that nothing had been read from `fin`. In reality, I expect to call this from a function which receives `fin` as an input parameter. – mgilson Apr 13 '13 at 01:36

1 Answers1

2

To me, it looks like the buffer size is hard-coded in Cpython to be 8192. As far as I can tell, there is no way to get this number from the python interface other than to read a single line when you open the file, do a f.tell() to figure out how much data python actually read and then seek back to the start of the file before continuing.

with open(datafile) as fin:
    next(fin)
    bufsize = fin.tell()
    fin.seek(0)

    ifin = dropwhile(lambda x:not x.startswith('Foo'), fin)
    header = next(ifin)
    position = fin.tell()

Of course, this fails in the event that the first line is longer than 8192 bytes long, but that's not of any real consequence for my application.

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • I see that `open` takes an optional `buffering` argument that sets the buffer size for the file. Do you know how that relates to this hardcoded buffer size? Are they different buffers or something? – Emily Apr 13 '13 at 14:23
  • @Emily -- Good question. I'm not actually sure. Maybe I need someone who knows the C source better than I do to take a look ... – mgilson Apr 13 '13 at 17:37
  • Well now I'm curious, so I started a new question: http://stackoverflow.com/questions/15991702/what-is-the-difference-between-the-buffering-argument-to-open-and-the-hardcode Will keep poking around the source in the meantime... – Emily Apr 13 '13 at 19:02