0

My app downloads a HTML webpage source code and then try to exctract html lines (tr). My code:

QStringList linesPage1 = page1.split(QRegularExpression("<tr.*>"));

But when I do this:

qDebug() << linesPage1;

I got this:

("<table width=\"1085\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\">", "")

When I try this code, he finds 31 occurences:

qDebug() << page1.count(QRegularExpression("<tr.*>"));

I don't understand why he counts 31 occurences but on another hand, he doesn't split the string.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
ceriums
  • 83
  • 1
  • 1
  • 7
  • 1
    Please note that the part your are splitting will get **removed** from the string! Could you post how the string looks before splitting it? – Felix Nov 19 '15 at 17:47
  • The string is too big to be pasted here. But it is a classical html table. – ceriums Nov 19 '15 at 18:22

1 Answers1

1

The problem is your regular expression. It tries to match a string that starts with <tr end ends with >. And it will look for the longest appeareance of that string. In your case, it will start with the first <tr and go until the end of the document (because HTML ends with a >).

To avoid this, use: <tr[^>]*>. This way it will only match the <tr ...>, because any string except of > is allowed in between.

Try to use webistes like https://regex101.com/#pcre to validate and test your regular expressions!

Felix
  • 6,885
  • 1
  • 29
  • 54
  • [Don't parse HTML with regex!](https://stackoverflow.com/a/1732454/399908) Simple counter-example: `hello world` -> first part would be `0) alert(i);">hello world` – Martin Hennings May 30 '18 at 12:55