It looks to me like you are using an old version of the flex skeleton (and hence of flex), in which string lengths were assumed to fit into int
s. The error message you are observing is probably the result of an int
overflowing to a negative value.
I believe that if you switch to version 2.5.37 or more recent, you'll find that most of those int
s have become size_t
and you should have no problem calling yy_scan_bytes
with an input buffer whose size exceeds 2 gigabytes. (The prototype for that function now takes a size_t
rather than an int
, for example.)
I have a hard time believing that doing so is a good idea, however. For a start, yy_scan_bytes
copies the entire string, because the lexical scanner wants a string it is allowed to modify, and because it wants to assure itself that the string has two NUL bytes at the end. Making that copy is going to needless use up a lot of memory, and if you're going to copy the buffer anyway, you might as well copy it in manageable pieces (say, 64Kib or even 1MiB.) That will only prove problematic if you have single tokens which are significantly larger than the chunk size, because flex is definitely not optimized for large single tokens. But for all normal use cases, it will probably work out a lot better.
Flex doesn't provide an interface for splitting a huge input buffer into chunks, but you can do it very easily by redefining the YY_INPUT
macro. (If you do that, you'll probably end up using yyin
as a pointer to your own buffer structure, which is theoretically non-portable. However, it will work on any Posix architecture, where all object pointers have the same representation.)
Of course, you would normally not want to wait while 3GB of data is accumulated in memory to start parsing it. You could parse incrementally as you read the data. (You might still need to redefine YY_INPUT
, depending on how you are reading the data.)