1

I am trying to extract an xml attribute from a std::string which is basically XML. I do not have the luxury of using an XML parser or anything outside the std, but note that I'm specifically looking just for this specific xml attribute and not really parsing the xml. Integrating a library/parser just for this specific extraction process does not make sense.

A sample string:

<Params>
<Element Name="elem(1)"/>
<Some Value="10"/>
<Element Name="elem(2)" /> 
<Attr Value="40" />
</Params>

The strings I need to extract are specifically: elem(1) and elem(2)

So to match I'm using the start and end variable

start string is  "<Element Name=\"" and string end "\"" 

I put together this code obviously scouring through many SO articles:

int main()
{
    const std::string s = "<Element Name=\"elem(1)\"/> <Some Value=\"10\" Unit=\"m\"/> <Element Name=\"elem(2)\"/> <Attr Value=\"40\" />";
    std::string start = "<Element Name=\"";
    std::string end = "\"";

    std::regex words_regex(start + "(.*)" + end);

    auto words_begin = std::sregex_iterator(s.begin(), s.end(), words_regex);
    auto words_end = std::sregex_iterator();

    std::cout << "Found " 
          << std::distance(words_begin, words_end) 
          << " words:\n";

    for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
       std::smatch match = *i;                                                 
       std::string match_str = match.str(); 
       std::cout << match_str << '\n';
    }
}

The problem is it returns the entire string ending at the last double quote. I will handle the part of collecting multiple sub-strings. But first I need to ensure the regex returns at-least the first sub-string correctly.

I've seen many mentions of positive look-ahead with regex and trying to understand it. But I'm not able to get it to work with std::regex yet. Is it fully supported? (Compiling on Visual Studio 2015 and GCC 4.8.2)

Other solutions are also welcome as long as they do not involve third party libraries and are achievable with std C++11 code.

JBL
  • 12,588
  • 4
  • 53
  • 84
  • 2
    You say it makes no sense to use an XML parser but depending on the scope of your code it might make sense in the end. If you're just supposed to get a specific property in an XML, using regex to do that makes *even less* sense than using a parser. If you can't use an XML parser that's another story. But if you can, I would strongly suggest using one. I'd also suggest looking at that [famous answer](https://stackoverflow.com/a/1732454/1594913). For giggles, but also because it captures the issue. – JBL Feb 09 '18 at 10:21
  • @JBL This code is part of a module that does not involve XML parsing (which is done elsewhere) except for just this one bit. I may have worded it incorrectly saying "does not make sense" when it should have been more on the lines of "unable to use a parser here". – Jagdish Rapata Feb 09 '18 at 10:26
  • 2
    make your pattern non-greedy. Form `.*` to `.*?` and see [this link](https://regex101.com/r/F2T6B9/1) – Shakiba Moshiri Feb 09 '18 at 10:31
  • 1
    and read my answer here => [std regex_search to match only current line](https://stackoverflow.com/questions/46087665/std-regex-search-to-match-only-current-line/46098368#46098368) – Shakiba Moshiri Feb 09 '18 at 10:34
  • 1
    @ShakibaMoshiri Thankyou! I'm getting the results perfectly now! Can you post this as an answer? Also, I'm slowly understanding regex is not really the answer to many string parsing problems. My belief (without any real-world tests) being std::regex will perform better than a crudely written for-loop with .find()'s and iterators. – Jagdish Rapata Feb 09 '18 at 10:54

1 Answers1

2

First make your pattern non-greedy.
From .* to .*? so that can match as short as possible. And it will be something like this:

"(.*?)"

then about std:regex library in see this link which is my experience with this library.

std regex_search to match only current line

Shakiba Moshiri
  • 21,040
  • 2
  • 34
  • 44
  • Thankyou! While I see regex is highly despised in this community to parse HTML or XML code, this at-least gives me an approach I could work with. While I have taken this advice seriously and trying to achieve this using basic std::string functions. – Jagdish Rapata Feb 09 '18 at 11:07
  • @JagdishRapata. You are welcome :). For such as thing it is right to use regex. Just know C++ regex has a lot of bugs. – Shakiba Moshiri Feb 09 '18 at 11:09