1

I have a large file. It's code page is CP1251. I want to parse it with boost spirit. And I parse it successfully while the parser meets non-standard characters. The boost documentation says:

Wide-character versions of the memory-mapped file Devices may be defined as follows, using the template code_converter:

#include <boost/iostreams/code_converter.hpp>
#include <boost/iostreams/device/mapped_file.hpp>

typedef code_converter<mapped_file_source>  wmapped_file_source;
typedef code_converter<mapped_file_sink>    wmapped_file_sink;

But should I use it? I my code I shouldn't have a sink. I suppose: my parser uses iterator from the source, code_converter converts them using code page I gave him, and send the translated chars to parser and it parses the file.

So, this is part of my code which doesn't work:

typedef boost::iostreams::code_converter<boost::iostreams::mapped_file>      wmapped_file_source;  
boost::locale::generator gen;
std::locale lru = gen("ru_RU.CP1251");
wmapped_file_source mmap;
mmap.imbue(lru);
mmap.open(current_task.filename);

RhAst::RhFile rh_file(this);
bool res = phrase_parse(mmap->begin(), mmap->end(), parser, space - eol, rh_file);

I tried to create my own locale object:

    class LocaleRus : public std::codecvt<wchar_t, char, std::mbstate_t>
{
public:
    explicit LocaleRus(size_t r = 0) : std::codecvt <wchar_t, char, std::mbstate_t> ( r )
    {
    }

protected:
    result do_in ( state_type&, const char* from, const char* from_end, const char*& from_next, char* to, char*, char*& to_next ) const
    {
        const int size = from_end - from;
        //::OemToCharBuff ( from, to, size );

        from_next = from + size;
        to_next = to + size ;

        return ok;
    }

    result do_out ( state_type&, const char* from, const char* from_end, const char*& from_next, char* to, char*, char*& to_next ) const
    {
        const int size = from_end - from;
        //::CharToOemBuff ( from, to, size );

        from_next = from + size;
        to_next = to + size ;

        return ok;
    }

    result do_unshift ( state_type&, char*, char*, char*& ) const { return ok; }
    int do_encoding () const throw () { return 1; }
    bool do_always_noconv () const throw () { return false; }

    int do_length ( state_type& state, const char* from, const char* from_end, size_t max ) const
    {
        return std::codecvt <wchar_t, char, std::mbstate_t>::do_length ( state, from, from_end, max );
    }

    int do_max_length () const throw ()
    {
        return std::codecvt <wchar_t, char, std::mbstate_t>::do_max_length ();
    }
};

and use it in code:

std::locale lru(std::locale(), new LocaleRus());

But its methods don't call. So, I didn't mind that it's too hard to read a memory mapped file with a non-standard code page. What do I do incorrectly?

denn
  • 337
  • 2
  • 13
  • If I understand the problem correctly, you are using default character parsers which are from `ascii` namespace and they are designed to accept only ascii characters. You can use parsers from `boost::spirit::standard` (like `boost::spirit::standard::char_`) to work with single-byte encodings or `boost::spirit::standard_wide` to work with wide-characters and `boost::spirit::unicode` to work with unicode characters. – Nikita Kniazev Nov 02 '18 at 16:54
  • The code_converter should convert the code page into ascii or not? – denn Nov 03 '18 at 08:30

1 Answers1

1

You should use it ¹, definitely.

What you're looking for is Spirit's stream-iterators. It has some predefined (boost::spirit::istream_iterator), but obviously you need custom types because of the custom stream.

What boost::spirit::istream_iterator does is wrap a regular iterator in the Multipass Iterator Adapter. Basically what it does is remove the forward-only-and-single-use limitations of InputIterator.

It does so by keeping a buffer for backtracking.

I think you should be able to use something similar to:

boost::locale::generator gen;
std::locale lru = gen("ru_RU.CP1251");

typedef boost::iostreams::code_converter<boost::iostreams::mapped_file>      wmapped_file_source;  
wmapped_file_source mmap;
mmap.imbue(lru);
mmap.open(current_task.filename);

RhAst::RhFile rh_file(this);

boost::iostreams::stream<wmapped_file_source> map_source(mmap);

typedef std::istreambuf_iterator<char> base_iterator_type;

spirit::multi_pass<base_iterator_type>
    first = spirit::make_default_multi_pass(base_iterator_type(map_source)),
    last  = spirit::make_default_multi_pass(base_iterator_type());

bool res = qi::phrase_parse(first, last, parser, qi::blank, rh_file);

Notes:

  1. I typed this in the browser, no time to check it yet
  2. You could use boost::iostreams::stream_buf instead - perhaps being more efficient(?)
  3. qi::space - qi::eol is qi::blank, so likely using boost::spirit::qi::blank_type as the skipper is more efficient
  4. BEWARE: depending on how your grammar is structured you might run into bad multi-pass edge cases. You may want to be explicit about when to flush (expectation points do this automatically), see e.g.


¹ assuming the conversions do what you need them to do

sehe
  • 374,641
  • 47
  • 450
  • 633
  • IIUC OP has the problem that `char_` fails on non-ascii characters (`boost::spirit::char_encoding::ascii::ischar(0xE0)` will return false). – Nikita Kniazev Nov 02 '18 at 16:49
  • Possibly. I don't see any parser expression, so it's hard to tell – sehe Nov 02 '18 at 17:36
  • I forgot to say, I use spirit x3, but prbably it doesn't have any meaning. – denn Nov 03 '18 at 08:12
  • @Nikita Kniazev, if I translate the input file into local 8-bit code page before the parsing, I parse it successfully. But I cannot do it because it's very large and the maximum size is undefined. I think the code_converter should do the work of translating on-the-fly and I can use the parser expressions without any changes (char_ or wchar_). – denn Nov 03 '18 at 08:16
  • @sehe, why do I need the buffer backtracking? – denn Nov 03 '18 at 08:35
  • "if I translate the input file into local 8-bit code page before the parsing" - in your question you said it's already: "I have a large file. It's code page is CP1251." You'd better to post an [MCVE](https://stackoverflow.com/help/mcve). – Nikita Kniazev Nov 03 '18 at 15:38
  • @denn You need it in order to run a PEG parser because - by nature - it must be able to backtrack, which isn't possible on `InputIterator` (requires at least `ForwardIterator`). The linked docs for `multi_pass` should explain in much more detail – sehe Nov 05 '18 at 14:58