3

what's an elegant way to extract data from string (perhaps using a boost library)?

Content-Type: text/plain
Content-Length: 15
Content-Date: 2/5/2013
Content-Request: Save

hello world

Let's say I have the above string and want to extract all the fields, including the hello world text. Can someone point me in the right direction?

marcoo
  • 821
  • 1
  • 9
  • 25
  • You're looking for some sort of *parser*. Can you describe the expected format of the string? Will it always be 6 lines? Those four field names? 5th line always empty? – Drew Dormann Feb 05 '13 at 19:45
  • More info about the format: Some of these fields are optional, so they might be ommited. What is assumed though, is that every field is on a new line, and after all the fields there's an empty line, followed by the actual content. – marcoo Feb 05 '13 at 19:48

8 Answers8

4

Try

  • http://pocoproject.org/

    Comes with HTTPServer and Client implementations

  • http://cpp-netlib.github.com/

    Comes with request/response handling

  • Boost Spirit demo: http://liveworkspace.org/code/3K5TzT

    You'd have to specify a simple grammar (or complex, if you wanted to 'catch' all the subtleties of HTTP)

    #include <boost/fusion/adapted.hpp>
    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/include/karma.hpp>
    
    typedef std::map<std::string, std::string> Headers;
    typedef std::pair<std::string, std::string> Header;
    struct Request { Headers headers; std::vector<char> content; };
    
    BOOST_FUSION_ADAPT_STRUCT(Request, (Headers, headers)(std::vector<char>, content))
    
    namespace qi    = boost::spirit::qi;
    namespace karma = boost::spirit::karma;
    
    template <typename It, typename Skipper = qi::blank_type>
        struct parser : qi::grammar<It, Request(), Skipper>
    {
        parser() : parser::base_type(start)
        {
            using namespace qi;
    
            header = +~char_(":\n") > ": " > *(char_ - eol);
            start = header % eol >> eol >> eol >> *char_;
        }
    
      private:
        qi::rule<It, Header(),  Skipper> header;
        qi::rule<It, Request(), Skipper> start;
    };
    
    bool doParse(const std::string& input)
    {
        auto f(begin(input)), l(end(input));
    
        parser<decltype(f), qi::blank_type> p;
        Request data;
    
        try
        {
            bool ok = qi::phrase_parse(f,l,p,qi::blank,data);
            if (ok)   
            {
                std::cout << "parse success\n";
                std::cout << "data: " << karma::format_delimited(karma::auto_, ' ', data) << "\n";
            }
            else      std::cerr << "parse failed: '" << std::string(f,l) << "'\n";
    
            if (f!=l) std::cerr << "trailing unparsed: '" << std::string(f,l) << "'\n";
            return ok;
        } catch(const qi::expectation_failure<decltype(f)>& e)
        {
            std::string frag(e.first, e.last);
            std::cerr << e.what() << "'" << frag << "'\n";
        }
    
        return false;
    }
    
    int main()
    {
        const std::string input = 
            "Content-Type: text/plain\n"
            "Content-Length: 15\n"
            "Content-Date: 2/5/2013\n"
            "Content-Request: Save\n"
            "\n"
            "hello world";
    
        bool ok = doParse(input);
    
        return ok? 0 : 255;
    }
    
sehe
  • 374,641
  • 47
  • 450
  • 633
4

Here is a pretty compact one written in C: https://github.com/openwebos/nodejs/blob/master/deps/http_parser/http_parser.c

Markus Schumann
  • 7,636
  • 1
  • 21
  • 27
2

There are several solutions. If the format is so easy, you can simply read the file line by line. If the line starts with a key, you can simply splits it to get the value. If it doesn't, the value is the line itself. It can be done with the STL very easily and quite elegantly.

If the grammar is more complex and as you added boost to the tags, you could considered Boost Spirit to parse it and get the values from it.

Baptiste Wicht
  • 7,472
  • 7
  • 45
  • 110
2

The simpliest solution, as I think, is to use regular expressions. There is a standard regexps in C++11 and some can be found in boost.

Artem Sobolev
  • 5,891
  • 1
  • 22
  • 40
1

You can use string::find with a whitespace to find where they are, then copy from that position until you find a '\n'

rubbyrubber
  • 567
  • 4
  • 19
1

If you want to write the code to parse it yourself, start by looking at the HTTP spec for this. This will give you the grammar:

    generic-message = start-line
                      *(message-header CRLF)
                      CRLF
                      [ message-body ]
    start-line      = Request-Line | Status-Line

So the first thing I would do is use split() on CRLF to break it into the composite lines. Then you can iterate through the resulting vector. Until you get to an element that is a blank CRLF, you are parsing a header, so you split on the first ':' to get the key and value.

Once you hit the empty element, you are parsing the response body.

Warning: having done this myself in the past, I can tell you not all webservers are consistant about the line endings (you may find only a CR or only an LF in places) and not all browsers / other layers of abstraction are consistant with what they pass to you. So you may find extra CRLFs in places you wouldn't expect or missing CRLFs in places you would expect them. Good luck.

Community
  • 1
  • 1
i_am_jorf
  • 53,608
  • 15
  • 131
  • 222
0

If you are ready to unroll your loop manually, you can use std::istringstream and normal overloads of the extraction operator (with proper manipulators such as get_time() for working with dates) to extract your data in a simple way.

Another possibility is to use std::regex to match all the patterns like <string>:<string> and iterate over all matches (the egrep grammar seems promising if you have several lines to process).

Or if you want to do it the hard way, and your string has a specific syntax, you can use Boost.Spirit to easily define a grammar and generate a parser.

Andy Prowl
  • 124,023
  • 23
  • 387
  • 451
0

If you have access to C+11 you could use std::regex (http://en.cppreference.com/w/cpp/regex).

std::string input = "Content-Type: text/plain";
std::regex contentTypeRegex("Content-Type: (.+)");

std::smatch match;

if (std::regex_match(input, match, contentTypeRegex)) {
     std::ssub_match contentTypeMatch = match[1];
     std::string contentType = contentTypeMatch.str();
     std::cout << contentType;
}
//else not found

Compiling/running version here: http://ideone.com/QTJrue

This regex is a very simplified case but it is the same principle for multiple fields.

Robert Prior
  • 508
  • 4
  • 14