2

I have to parse specific html code from a website. Here is part of it:

<div class="_ss">
    <div class="info">
        First info.
    </div>
    <div class="info">
        Second info.
    </div>
    <div class="info">
        Third info.
    </div>
</div>

I've defined a regular expression as follows:

QRegExp rx("<div class=\"info\">(.+)</div>");

It currectly matches all blocks but the matched text includes all the subsequent blocks. For instance, in the case of Second, it returns:

    <div class="info">
        Second info.
    </div>
    <div class="info">
        Third info.
    </div>
</div>

I thought i can just add ? to my regex to get the planned result:

QRegExp rx("<div class=\"info\">(.+?)</div>");

However, using this regex results in no match at all.

Iuliu
  • 4,001
  • 19
  • 31
Efog
  • 1,155
  • 1
  • 15
  • 33
  • 1
    I've browsed the regex docs of QT. Jumping to the [quantifiers section](http://qt.developpez.com/doc/4.7/QRegExp/#quantifiers), it seems there's no way to make your quantifier lazy/ungreedy unlike in perl style regexes where you might add `?` after your quantifier. Reading the note in the quantifiers section it seems you will need to use [`setMinimal()`](http://qt.developpez.com/doc/4.7/qregexp/#setminimal) – HamZa Nov 17 '14 at 17:58
  • 2
    Thanks! Can you write it in answers? I will accept it. – Efog Nov 17 '14 at 18:01
  • 2
    @Efog See this funny post about parsing html with regex - http://stackoverflow.com/a/1732454/492336 – sashoalm Nov 17 '14 at 18:19
  • I've tried QDomDocument, but it even didn't gave me any `div`'s usin `elementsByTagName('div');` – Efog Nov 17 '14 at 18:22
  • Show how you use QDomDocument. – Pavel Strakhov Nov 17 '14 at 21:39
  • `QDomDocument d;` `d.parse(my_html);` `QDomNodeList divs = d.elementsByTagName('div');` `cout << divs.count(); //0` – Efog Nov 17 '14 at 23:01

1 Answers1

1

I've browsed the regex docs of Qt. Jumping to the quantifiers section, it seems there's no way to make your quantifier lazy/ungreedy unlike in perl style regexes where you might add ? after your quantifier. Reading the note in the quantifiers section it seems you will need to use setMinimal().

Here's a code sample:

QString str = "<div class=\"_ss\">\
        <div class=\"info\">\
            First info.\
        </div>\
        <div class=\"info\">\
            Second info.\
        </div>\
        <div class=\"info\">\
            Third info.\
        </div>\
    </div>"; // Some input

QStringList list;
int pos = 0;

QRegExp rx("<div class=\"info\">(.+)</div>");
rx.setMinimal(true); // Make our regex lazy/ungreedy

// Looping through our matches
while((pos = rx.indexIn(str, pos)) != -1){
    list << rx.cap(1); // Add group 1 to our list
    pos += rx.matchedLength();
}

// Looping and printing
for(pos = 0;pos < list.size();pos++){
    std::cout << list.at(pos).toStdString() << std::endl;
}

Note: You might need to trim the results since the spaces are also included.

HamZa
  • 14,671
  • 11
  • 54
  • 75