2

I am fairly new to regular expressions and have been having difficulty using one to extract the data I am after. Specifically, I am looking to extract the date touched and the the counter from the following:

<span style="color:blue;">&lt;query&gt;</span>
  <span style="color:blue;">&lt;pages&gt;</span>
    <span style="color:blue;">&lt;page pageid=&quot;3420&quot; ns=&quot;0&quot; title=&quot;Test&quot; touched=&quot;2011-07-08T11:00:58Z&quot; lastrevid=&quot;17889&quot; counter=&quot;9&quot; length=&quot;6269&quot; /&gt;</span>
    <span style="color:blue;">&lt;/pages&gt;</span>
  <span style="color:blue;">&lt;/query&gt;</span>
<span style="color:blue;">&lt;/api&gt;</span>

I am currently using vs2010. My current expression is:

std::tr1::regex rx("(?:.*touch.*;)?([0-9-]+?)(?:T.*count.*;)([0-9]+)(&.*)?");
std::tr1::regex_search(buffer, match, rx);

match[1] contains the following:

    2011-07-08T11:00:58Z&quot; lastrevid=&quot;17889&quot; counter=&quot;9&quot; length=&quot;6269&quot; /&gt;</span>
    <span style="color:blue;">&lt;/pages&gt;</span>
  <span style="color:blue;">&lt;/query&gt;</span>
<span style="color:blue;">&lt;/api&gt;</span>

match[2] contains the following:

6269&quot; /&gt;</span>
    <span style="color:blue;">&lt;/pages&gt;</span>
  <span style="color:blue;">&lt;/query&gt;</span>
<span style="color:blue;">&lt;/api&gt;</span>

I am looking for just "2011-07-08" in match[1] and just "9" in match[2]. The date format will never alter, but the counter will almost certainly be much larger.

Any help would be highly appreciated.

Artanthos
  • 23
  • 4
  • 3
    Extracting information out of XML documents is much easier (and more maintainable...) with XPath than with regular expressions. – Frerich Raabe Jul 27 '11 at 13:18
  • 1
    @Artanthos: Isn't HTML? [Regardless, I believe this might apply.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – yarian Jul 27 '11 at 13:20
  • Yes, it is HTML. I am scraping usage data off a website and this is what I have available to work with. – Artanthos Jul 27 '11 at 15:25

2 Answers2

1

That's because cmatch::operator[](int i) returns a sub_match, whose sub_match::operator basic_string() (used in the context of cout) returns a string starting at the beginning of the match and ending at the end of the source string.

Use sub_match::str(), i.e. match[1].str() and match[2].str().

Moreover, you'll need your expression to be more specific: .* tries to match the world, and gives up some if it can't.

Try std::tr1::regex rx("touched=&quot;([0-9-]+).+counter=&quot;([0-9]+)");.

You could even use non-greedy matchers (like +? and *?) to prevent excessive matching.

Raphaël Saint-Pierre
  • 2,498
  • 1
  • 19
  • 23
0

Try

std::tr1::regex rx("(?:.*touch.*;)?([0-9-]+)(?:T.*count.*;)([0-9]+)(&.*)?");

removing the question mark makes the term greedy, so it will fill as much as it can.

marc
  • 6,103
  • 1
  • 28
  • 33