2

I am downloading a web page and I am trying to extract some values from it.

The places of the page that I am interested in are of this type:

<a data-track=\"something\" href=\"someurl\" title=\"Heaven\"><img src=\"somesource.jpg\" /></a>

and I need to extract the href (someurl) value. Note that there are multiple entries like the one above in the HTML string that I have and thus I will use a list to store all the URLs that I extract from the string.

This is what I've tried so far:

QString html_str=myfile();
QRegExp regex("<a data-track\\=\"something\" href\\=\".*(?=\" title)");
if(regex.indexIn(html_str) != -1){
    QStringList list;
    QString str;
    list = regex.capturedTexts();
    foreach(str,list)
        qDebug() << str.remove("<a data-track=\"something\" href=\"");
}

With the above code I get only one occurrence (list.count() == 1) which contains the whole HTML string from the first occurrence of someurl till the end of the file, without the <a data-track="something" href="" in it, which have all been removed.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
hytromo
  • 1,501
  • 2
  • 27
  • 57

2 Answers2

0

I'd do it like this: (make sure you double check your regex)

QRegExp regex("<a data-track=\"something\" href=\".*(?=\" title)");

if (regex.indexIn(html_str) != -1) qDebug() << html_str.cap().remove(<a data-track=\"something\" href=\");
Niklas
  • 23,674
  • 33
  • 131
  • 170
0

You can use a while loop to control the position of the "html_str"

pos = regex.indexIn(htmlContent);    // get the first position
while(pos = regex.indexIn(htmlContent, pos) != -1){    // continue next
    QStringList list;
    list = regex.capturedTexts();
    foreach(QString url, list) {
        // do something
    }
    pos += regex.matchedLength();
}
hugle
  • 145
  • 1
  • 5