Extract href value from html string using QRegExp

Question

I am downloading a web page and I am trying to extract some values from it.

The places of the page that I am interested in are of this type:

<a data-track=\"something\" href=\"someurl\" title=\"Heaven\"><img src=\"somesource.jpg\" /></a>

and I need to extract the href (someurl) value. Note that there are multiple entries like the one above in the HTML string that I have and thus I will use a list to store all the URLs that I extract from the string.

This is what I've tried so far:

QString html_str=myfile();
QRegExp regex("<a data-track\\=\"something\" href\\=\".*(?=\" title)");
if(regex.indexIn(html_str) != -1){
    QStringList list;
    QString str;
    list = regex.capturedTexts();
    foreach(str,list)
        qDebug() << str.remove("<a data-track=\"something\" href=\"");
}

With the above code I get only one occurrence (list.count() == 1) which contains the whole HTML string from the first occurrence of someurl till the end of the file, without the <a data-track="something" href="" in it, which have all been removed.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Moshe Shaham, Feb 22 '13 at 20:41

score 0 · Answer 1 · answered Feb 16 '14 at 15:05

0

I'd do it like this: (make sure you double check your regex)

QRegExp regex("<a data-track=\"something\" href=\".*(?=\" title)");

if (regex.indexIn(html_str) != -1) qDebug() << html_str.cap().remove(<a data-track=\"something\" href=\");

answered Feb 16 '14 at 15:05

Niklas

23,674
33
131
170

hugle · Answer 2 · 2014-02-21T05:34:09.007

0

You can use a while loop to control the position of the "html_str"

pos = regex.indexIn(htmlContent);    // get the first position
while(pos = regex.indexIn(htmlContent, pos) != -1){    // continue next
    QStringList list;
    list = regex.capturedTexts();
    foreach(QString url, list) {
        // do something
    }
    pos += regex.matchedLength();
}

edited Feb 21 '14 at 05:34

answered Feb 21 '14 at 05:23

hugle

145
1
5

Extract href value from html string using QRegExp

2 Answers2