2

I am writing a small program to process a big text file and do some replacements. The thing is that it never stops allocating new memory, so in the end it runs out of memory. I have reduced it to a simple program that simply counts the number of lines (see the code below) while still allocating more and more memory. I must admit that I know little about boost and boost spirit in particular. Could you please tell me what I am doing wrong? Thanks a million!

#include <string>
#include <iostream>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/bind.hpp>
#include <boost/ref.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>

// Token ids
enum token_ids {
    ID_EOL= 100
};

// Token definition
template <typename Lexer>
    struct var_replace_tokens : boost::spirit::lex::lexer<Lexer> {
        var_replace_tokens() {
            this->self.add ("\n", ID_EOL); // newline characters
        }
    };

// Functor
struct replacer {
    typedef bool result_type;
    template <typename Token>
    bool operator()(Token const& t, std::size_t& lines) const  {
        switch (t.id()) {
        case ID_EOL:
            lines++;
            break;  
        }
        return true;
    }
}; 

int main(int argc, char **argv) {
    size_t lines=0;

    var_replace_tokens< boost::spirit::lex::lexertl::lexer< boost::spirit::lex::lexertl::token< boost::spirit::istream_iterator> > > var_replace_functor;

    cin.unsetf(std::ios::skipws);

    boost::spirit::istream_iterator first(cin);
    boost::spirit::istream_iterator last;

    bool r = boost::spirit::lex::tokenize(first, last, var_replace_functor,  boost::bind(replacer(), _1, boost::ref(lines)));

    if (r) {
        cerr<<"Lines processed: "<<lines<<endl;
    }  else {
        string rest(first, last);
        cerr << "Processing failed at: "<<rest<<" (line "<<lines<<")"<<endl;
    }
}
Felipe
  • 23
  • 2
  • 1
    how big is the file / how many lines does it has? – Hayt Nov 09 '16 at 10:25
  • Probably you have no memory leak. Probably input text file is too big to fit in memory. – ks1322 Nov 09 '16 at 10:50
  • 1
    It must be the multi_pass iterator adaptor. Since there is no grammar Spirit doesn't know when it can be flushed.I'll look at this when I have time – sehe Nov 09 '16 at 10:54
  • The file is 7 GB big. – Felipe Nov 09 '16 at 11:01
  • As fas as I know, istream_iterator takes care of reading the input stream without having to store the whole stream into memory. Actually, the program starts outputting things (no this, the original one) since the very beginning. – Felipe Nov 09 '16 at 11:03
  • Have you tried breaking the program and stepping through it to see what it allocates? If you create some internal objects during parsing they will probaly fill up the memory. – Hayt Nov 09 '16 at 11:24
  • The size I gave before is before decompressing the file. After decompression it is more than 150 GB big. – Felipe Nov 09 '16 at 11:37
  • @Felipe Which version of boost? – Dan Mašek Nov 09 '16 at 13:21
  • I have tried with 1.55.0 and 1.62.0. Same behaviour in both cases. – Felipe Nov 09 '16 at 14:54

1 Answers1

7

The behaviour is by design.

  • Me: It must be the multi_pass iterator adaptor. Since there is no grammar Spirit doesn't know when it can be flushed. [...]

  • You: As fas as I know, istream_iterator takes care of reading the input stream without having to store the whole stream into memory

Yes. But you're not using std::istream_iterator. You're using Boost Spirit. Which is a parser generator. Parsers need random access for backtracking.

Spirit supports input iterators by adapting an input sequence to a random-access sequence with the multi_pass adaptor. This iterator adaptor stores a variable-size buffer¹ for backtracking purposes. Certain actions (expectation points, always-greedy operators like Kleene-* etc) tell the parser framework when it's safe to flush the buffer.

The Problem:

You're not parsing, just tokenizing. Nothing ever tells the iterator to flush its buffers.

The buffer is unbounded, so memory usage grows. Of course it's not a leak because as soon as the last copy of a multi-pass adapted iterator goes out of scope, the shared backtracking buffer is freed.

The Solution:

The simplest solution is to use a random access source. If you can, use a memory mapped file.

Other solutions would involve telling the multi-pass adaptor to flush. The simplest way to achieve this would be to use tokenize_and_parse. Even with a faux grammar like *(any_token) this should be enough to convince the parser framework you will not be asking it to backtrack.

Inspiration:


¹ http://www.boost.org/doc/libs/1_62_0/libs/spirit/doc/html/spirit/support/multi_pass.html by default it stores a shared deque. See it after running your test for a little while using dd if=/dev/zero bs=1M | valgrind --tool=massif ./sotest:

enter image description here

Clearly shows all the memory in

100.00% (805,385,576B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.99% (805,306,368B) 0x4187D5: void boost::spirit::iterator_policies::split_std_deque::unique<char>::increment<boost::spirit::multi_pass<std::istream, boost::spirit::iterator_policies::default_policy<boost::spirit::iterator_policies::ref_counted, boost::spirit::iterator_policies::no_check, boost::spirit::iterator_policies::istream, boost::spirit::iterator_policies::split_std_deque> > >(boost::spirit::multi_pass<std::istream, boost::spirit::iterator_policies::default_policy<boost::spirit::iterator_policies::ref_counted, boost::spirit::iterator_policies::no_check, boost::spirit::iterator_policies::istream, boost::spirit::iterator_policies::split_std_deque> >&) (in /home/sehe/Projects/stackoverflow/sotest)
| ->99.99% (805,306,368B) 0x404BC3: main (in /home/sehe/Projects/stackoverflow/sotest)
Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Thank you very much for your answer. I have adapted the Boost example that parses a file and counts the number of lines, words and characters (http://www.boost.org/doc/libs/1_62_0/libs/spirit/example/lex/word_count.cpp) to read from the standard input and it still does not stop allocating memory. – Felipe Nov 10 '16 at 11:36
  • @Felipe I just checked. Indeed, contrary to what [discussion here suggests](http://boost.2283326.n4.nabble.com/Can-I-take-control-of-a-multipass-iterator-within-a-member-function-tp2674288p4689753.html) expectation points inside the kleene-star/plus did **not** flush the iterator. I presume this is an untested edge case with lex iterators on a multi-pass-adaptor. – sehe Nov 10 '16 at 13:36
  • Here's a [workaround using the `flush_multi_pass` directive](http://coliru.stacked-crooked.com/a/73a5dd5efa7de356). In my test memory usage drops from 1.3GiB to 82kB in te case of 90MiB input file – sehe Nov 10 '16 at 13:41
  • 1
    Thank you very much, indeed. That solved the problem! – Felipe Nov 10 '16 at 14:55