C++ extract data from string

Question

what's an elegant way to extract data from string (perhaps using a boost library)?

Content-Type: text/plain
Content-Length: 15
Content-Date: 2/5/2013
Content-Request: Save

hello world

Let's say I have the above string and want to extract all the fields, including the hello world text. Can someone point me in the right direction?

You're looking for some sort of *parser*. Can you describe the expected format of the string? Will it always be 6 lines? Those four field names? 5th line always empty? — Drew Dormann, Feb 05 '13 at 19:45
More info about the format: Some of these fields are optional, so they might be ommited. What is assumed though, is that every field is on a new line, and after all the fields there's an empty line, followed by the actual content. — marcoo, Feb 05 '13 at 19:48

sehe · Answer 1 · 2013-02-05T20:03:52.563

Try

http://pocoproject.org/

Comes with HTTPServer and Client implementations
http://cpp-netlib.github.com/

Comes with request/response handling

Boost Spirit demo: http://liveworkspace.org/code/3K5TzT

You'd have to specify a simple grammar (or complex, if you wanted to 'catch' all the subtleties of HTTP)

#include <boost/fusion/adapted.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>

typedef std::map<std::string, std::string> Headers;
typedef std::pair<std::string, std::string> Header;
struct Request { Headers headers; std::vector<char> content; };

BOOST_FUSION_ADAPT_STRUCT(Request, (Headers, headers)(std::vector<char>, content))

namespace qi    = boost::spirit::qi;
namespace karma = boost::spirit::karma;

template <typename It, typename Skipper = qi::blank_type>
    struct parser : qi::grammar<It, Request(), Skipper>
{
    parser() : parser::base_type(start)
    {
        using namespace qi;

        header = +~char_(":\n") > ": " > *(char_ - eol);
        start = header % eol >> eol >> eol >> *char_;
    }

  private:
    qi::rule<It, Header(),  Skipper> header;
    qi::rule<It, Request(), Skipper> start;
};

bool doParse(const std::string& input)
{
    auto f(begin(input)), l(end(input));

    parser<decltype(f), qi::blank_type> p;
    Request data;

    try
    {
        bool ok = qi::phrase_parse(f,l,p,qi::blank,data);
        if (ok)   
        {
            std::cout << "parse success\n";
            std::cout << "data: " << karma::format_delimited(karma::auto_, ' ', data) << "\n";
        }
        else      std::cerr << "parse failed: '" << std::string(f,l) << "'\n";

        if (f!=l) std::cerr << "trailing unparsed: '" << std::string(f,l) << "'\n";
        return ok;
    } catch(const qi::expectation_failure<decltype(f)>& e)
    {
        std::string frag(e.first, e.last);
        std::cerr << e.what() << "'" << frag << "'\n";
    }

    return false;
}

int main()
{
    const std::string input = 
        "Content-Type: text/plain\n"
        "Content-Length: 15\n"
        "Content-Date: 2/5/2013\n"
        "Content-Request: Save\n"
        "\n"
        "hello world";

    bool ok = doParse(input);

    return ok? 0 : 255;
}

I'm fairly certain those will be \r\n's at the end of the line in real data. — i_am_jorf, Feb 05 '13 at 23:39

score 4 · Accepted Answer · answered Feb 05 '13 at 19:49

4

Here is a pretty compact one written in C: https://github.com/openwebos/nodejs/blob/master/deps/http_parser/http_parser.c

answered Feb 05 '13 at 19:49

Markus Schumann

7,636
1
21
27

score 2 · Answer 3 · answered Feb 05 '13 at 19:46

There are several solutions. If the format is so easy, you can simply read the file line by line. If the line starts with a key, you can simply splits it to get the value. If it doesn't, the value is the line itself. It can be done with the STL very easily and quite elegantly.

If the grammar is more complex and as you added boost to the tags, you could considered Boost Spirit to parse it and get the values from it.

score 2 · Answer 4 · answered Feb 05 '13 at 19:47

2

The simpliest solution, as I think, is to use regular expressions. There is a standard regexps in C++11 and some can be found in boost.

answered Feb 05 '13 at 19:47

Artem Sobolev

5,891
1
22
40

1

I find Boost.Xpressive to be very useful for this kind of thing. – Bob Murphy Feb 05 '13 at 19:56

score 1 · Answer 5 · answered Feb 05 '13 at 19:47

1

You can use string::find with a whitespace to find where they are, then copy from that position until you find a '\n'

answered Feb 05 '13 at 19:47

rubbyrubber

567
4
19

score 1 · Answer 6 · edited May 23 '17 at 12:08

If you want to write the code to parse it yourself, start by looking at the HTTP spec for this. This will give you the grammar:

    generic-message = start-line
                      *(message-header CRLF)
                      CRLF
                      [ message-body ]
    start-line      = Request-Line | Status-Line

So the first thing I would do is use split() on CRLF to break it into the composite lines. Then you can iterate through the resulting vector. Until you get to an element that is a blank CRLF, you are parsing a header, so you split on the first ':' to get the key and value.

Once you hit the empty element, you are parsing the response body.

Warning: having done this myself in the past, I can tell you not all webservers are consistant about the line endings (you may find only a CR or only an LF in places) and not all browsers / other layers of abstraction are consistant with what they pass to you. So you may find extra CRLFs in places you wouldn't expect or missing CRLFs in places you would expect them. Good luck.

score 0 · Answer 7 · answered Feb 05 '13 at 19:47

If you are ready to unroll your loop manually, you can use std::istringstream and normal overloads of the extraction operator (with proper manipulators such as get_time() for working with dates) to extract your data in a simple way.

Another possibility is to use std::regex to match all the patterns like <string>:<string> and iterate over all matches (the egrep grammar seems promising if you have several lines to process).

Or if you want to do it the hard way, and your string has a specific syntax, you can use Boost.Spirit to easily define a grammar and generate a parser.

score 0 · Answer 8 · answered Feb 05 '13 at 20:02

If you have access to C+11 you could use std::regex (http://en.cppreference.com/w/cpp/regex).

std::string input = "Content-Type: text/plain";
std::regex contentTypeRegex("Content-Type: (.+)");

std::smatch match;

if (std::regex_match(input, match, contentTypeRegex)) {
     std::ssub_match contentTypeMatch = match[1];
     std::string contentType = contentTypeMatch.str();
     std::cout << contentType;
}
//else not found

Compiling/running version here: http://ideone.com/QTJrue

This regex is a very simplified case but it is the same principle for multiple fields.

C++ extract data from string

8 Answers8