How to parse very large buffer with flex and bison

Question

I've written a json library that uses flex and bison to parse serialized-json (i.e strings) — and deserialize them to json objects. It works great for small strings.

However, it fails to work with a very large strings (I tried strings of almost 3 GB) with this error:

‘fatal flex scanner internal error--end of buffer missed’

I want to know what is the maximum size of buffer which I can pass to this function:

//js: serialized json stored in std::string 

yy_scan_bytes(js.data(), js.size());

and how can make flex/bison work with large buffers?

Where do your serialized-json files come from ? If you generate yourself theses serialized-json, you should split them into several files. — Pierre Emmanuel Lallemant, Feb 24 '16 at 12:19
Why don't you use or study the source code of existing JSON libraries, like [jansson](http://www.digip.org/jansson/) or [jsoncpp](https://github.com/open-source-parsers/jsoncpp). Using `bison` & `flex` is overkill for JSON parsing — Basile Starynkevitch, Feb 24 '16 at 12:21
@PierreEmmanuelLallemant: Yes, that is one option (as workaround). However, if possible, I want the library to have this ability to parse strings of any size, as long as *time* and *memory* allow. — Nawaz, Feb 24 '16 at 12:21
@BasileStarynkevitch: those are alternatives we cannot use at this moment. The problem is: how to use flex/bison to do this job (even if that is overkill, that is a different matter altogether). — Nawaz, Feb 24 '16 at 12:22
Dive into the source code of `flex` & `bison` and into the generated code. But you are using the wrong tools (`flex` & `bison` are not appropriate for you, if you need to parse 3Gbytes string) and reinventing the wheel. BTW, you should explain in your question why you cannot use existing JSON libraries — Basile Starynkevitch, Feb 24 '16 at 12:24
Is the maximum size of buffer that flex/bison can parse without fail, known? — Nawaz, Feb 24 '16 at 12:26
@BasileStarynkevitch: *BTW, you should explain in your question why you cannot use existing JSON libraries*. That is a different question. You can assume anything that satisfies you, such as we've this internal library, which is used pretty much everywhere (which is true BTW, not an assumption). — Nawaz, Feb 24 '16 at 12:29
You should feed *flex* from a stream, not a buffer. Wherever you're getting these huge strings from, you're just adding latency by reading them all entirely first. — user207421, Feb 24 '16 at 14:00

score 3 · Accepted Answer · answered Feb 26 '16 at 04:29

3

It looks to me like you are using an old version of the flex skeleton (and hence of flex), in which string lengths were assumed to fit into ints. The error message you are observing is probably the result of an int overflowing to a negative value.

I believe that if you switch to version 2.5.37 or more recent, you'll find that most of those ints have become size_t and you should have no problem calling yy_scan_bytes with an input buffer whose size exceeds 2 gigabytes. (The prototype for that function now takes a size_t rather than an int, for example.)

I have a hard time believing that doing so is a good idea, however. For a start, yy_scan_bytes copies the entire string, because the lexical scanner wants a string it is allowed to modify, and because it wants to assure itself that the string has two NUL bytes at the end. Making that copy is going to needless use up a lot of memory, and if you're going to copy the buffer anyway, you might as well copy it in manageable pieces (say, 64Kib or even 1MiB.) That will only prove problematic if you have single tokens which are significantly larger than the chunk size, because flex is definitely not optimized for large single tokens. But for all normal use cases, it will probably work out a lot better.

Flex doesn't provide an interface for splitting a huge input buffer into chunks, but you can do it very easily by redefining the YY_INPUT macro. (If you do that, you'll probably end up using yyin as a pointer to your own buffer structure, which is theoretically non-portable. However, it will work on any Posix architecture, where all object pointers have the same representation.)

Of course, you would normally not want to wait while 3GB of data is accumulated in memory to start parsing it. You could parse incrementally as you read the data. (You might still need to redefine YY_INPUT, depending on how you are reading the data.)

answered Feb 26 '16 at 04:29

rici

234,347
28
237
341

Yes, it is `int` overflow. I've verified it. And I've commented the same on the *now-deleted* answer (which you can still see anyway). – Nawaz Feb 26 '16 at 05:36
@nawaz: Do you want me to explain in painful detail why that integer overflow causes precisely that error message :) The answer holds: either upgrade flex, or feed the string in pieces (which will work with old versions as well), or -- ideally -- do both. – rici Feb 26 '16 at 05:41
@nawaz: Fwiw, I believe you'll see the error when you reach the end of the buffer, because flex checks to make sure that the current buffer pointer is still less than the end of the buffer (i.e. `&yy_ch_buf[yy_n_chars + 1]`). But since `yy_n_chars` is an `int` in your version, that index is negative, so the comparison fails. – rici Feb 26 '16 at 05:47
Your analysis is correct. I've accepting this answer. – Nawaz Feb 26 '16 at 05:47
I tried to parse `file.json` in python, and the module `json` fails to parse it (see [topic-1](http://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python) and [topic-2](http://stackoverflow.com/questions/2400643/is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python) on SO) – Nawaz Feb 26 '16 at 11:52
@nawaz: those links appear to be about memory issues: it's not that the module can't parse the huge input; rather, the available memory is not enough to hold the resulting structures. (Which is not surprising.) You might be able to create a more compact internal structure in C. – rici Feb 26 '16 at 14:00
Maybe in those links, the issues are causes by unavailability of memory. I dont think, on my system, I've this problem, because it has lots of RAM, more than 100gb. – Nawaz Feb 26 '16 at 16:45
1

@nawaz: perhaps, but I can imagine a 3GB json as a Python data structure being more than 100 GB. Also, you'd need to verify your `ulimit` in case you've configured your system to restrict the maximum size of an individual process. – rici Feb 26 '16 at 19:11

How to parse very large buffer with flex and bison

1 Answers1