File buffering in lexers: is it advisable, now that the OS (and language libraries) already implement buffers internally?

Question

A technique of lexer optimization was mentioned in two books: Engineering a compiler by K.Cooper et al. and Crafting a Compiler by C.Fischer et al. Here is the except from the first book (page 69):

While character-by-character I/O leads to clean algorithmic formulations, the overhead of a procedure call per character is signiﬁcant relative to the cost of simulating the DFA in either a table-driven or a direct-coded scanner. To reduce the I/O cost per character, the compiler writer can use buffered I/O, where each read operation returns a longer string of characters, or buffer, and the scanner then indexes through the buffer. The scanner maintains a pointer into the buffer...

My question is, what is the significance of this technique? Now that memory buffering is often already implemented by the Operating System, why did the authors suggest we implement a buffer? (Also, standard libraries provided by high-level languages usually have a buffer maintained with the filestream handling routines, like C++'s std::ifstream).

I know in some applications, like database systems, a custom buffer mechanism is not only desirable (knows more about access patterns), but sometimes necessary (recovery, logging). Do similar reasongs apply to lexers in compilers? If so, how?

EDIT: Here is a similar question: link, but I'd like to know more about the arguments for a custom buffer (like the argument supporting a buffer in database systems), if there is any.

Another post here compared manual buffering with std::fstream buffering in C++.

*All* applications use buffers. It isn't confined to compiler writers. Operating systems have been buffering since about 1955. Doesn't change anything. The overhead of a system call per character is far too high to pay. — user207421, Apr 26 '18 at 03:23
@EJP Yes. But does having a custom buffer really speed up a lexer, now that the OS and language libraries already have it? — Leedehai, Apr 26 '18 at 03:25
OS and language libraries have *always* had it. Your question is based on a false assumption. — user207421, Apr 26 '18 at 03:26
Your assumption is that something about buffering has recently changed that invalidates the quotation from your book. You keep saying things like '*now* that memory buffering is often already implemented by the Operating System' and '*now* that OS and language libraries have it'. There is no 'now'. They've always had it. Nothing has changed. It is always faster to get > 1 character at a time from wherever you're getting it from. You're also asking why the authors said something when they did now that something has allegedly changed, which is just meaningless. — user207421, Apr 26 '18 at 03:28
Ok got it. In my sentence "now that" is a phrase, meaning "since", "because", sorry for the confusion. — Leedehai, Apr 26 '18 at 03:29
Well that's simply not what 'now that' means, is it? In any case it still doesn't make any difference. All applications buffer. That's why you have things like C `stdio` or C++ streams or `java.io.BufferedInputStream` for example. The 'implication' is that it is always quicker to advance a pointer or an index that it is to call a method or a function or a system call. — user207421, Apr 26 '18 at 03:30

score 1 · Accepted Answer · answered Apr 26 '18 at 10:17

As others have pointed out, whether the OS buffers or not (it does) it is very costly for your application to rely on it as those OS/File System buffers are not in your applications address space. Why? Because for your app to get at that data it typically needs to travel through layers of calls to get to the OS buffers. If you are doing this for 1 character/byte at a time this will incur overhead.

If you are using an IO library: Some of them do or will read 'ahead' for performance reasons and keep the OS calls to a minimum.

If you, on the other hand, are operating without leveraging a library then it is strongly advised for you to setup buffered IO capability for the same reason other libraries will.

Finally, the end result of your compilation is an executable thing. Unless you do not allow IO to occur you will want to have your language specific (assume self hosted) run-time to provide buffered IO for the same reasons. If your run-time is based on a language or series of libraries that provide it, you should be good.

score 1 · Answer 2 · answered Apr 26 '18 at 16:50

Do similar reasons apply to lexers in compilers?

Sometimes, certainly. For instance, consider a lexer rule that returns a token indicating "this is a variable name". The parser needs not only the "is a variable name" token itself, but also the actual name: it does no good to know that this is a name without knowing whether the name is Fred, or Wilma, or Barney, or whatever.

If it is a name, though, where should the name be stored? Could you grant the parser access directly to the byte stream holding the name itself? (If so, for how long? How long does the parser need it?)

Or consider a string literal (with whatever syntax these have). Where do the characters that make up the string get stored? Can you grant access directly to the original buffer?

If you own the data structures that provide these scan-time buffers, you can grant limited access to them and know what, precisely, the lifetime of the borrowed strings might be. That may, depending on many more details, let you avoid making copies (at least sometimes). If you're using some library-provided buffered I/O routines that promise only to deliver one character at a time, you definitely cannot grant such access.

I think you might misunderstood my question. I was talking about the buffering used in the file stream reading routines, and that buffer needs to be cleared when filled up, so it can't guarantee the lifetime of anything stored in it. But indeed, what you said here is a valid point, though on a different topic. The names, literals, etc all need to be stored somewhere so that the compiler can reference them later. — Leedehai, Apr 26 '18 at 17:00
If you look inside the implementation of flex, for instance, you'll see that despite using C stdio by default (which does its own buffering to avoid system call overhead), flex has its own buffers and you're allowed to use them in your parser. Typically the symbol and string tables will *intern* (term taken from Lisp, see eg http://clhs.lisp.se/Body/f_intern.htm) a single copy of most names and some strings and use the interned copy from then on. This allows name equality test via simple pointer comparison. — torek, Apr 26 '18 at 17:04

File buffering in lexers: is it advisable, now that the OS (and language libraries) already implement buffers internally?

2 Answers2