0

Context The case is screen scraping web content using QuotaXML SDK 1.6 to finally display the data on the dashboard and on the iPhone. This QuotaXML tool offers regex for extracting table data only. QuotaXML does parse html tables using a three step approach. 1. First it identifies the table, for example using "(?si)<table.*?>(.*?)</table>" 2. Second within this parsed table it identifies rows, like "(?si)<tr.*?>(.*?)</tr>" 3. Third within this row scope, individual cells are identified like "(?si)<tr.*?>(.*?)</tr>"

The problem The source html contains some rows that are not relevant data like lines or images that span full table width using a colspan. Or tables contain data cells which are not relevant to the data lines needed, like call detail records which also contain calls to freephones which are not substracted from the minutes in your plan, in this case 0800 and 00800 numbers. In other words (.*?) may not match ' colspan="' neither '>0800' neither '>00800'.

In code:

exclude:<tr><td colspan="2"></td></tr>
include:<tr><td><strong>Date</strong></td><td><strong>Time</strong></td></tr>
exclude:<tr><td>05-01-2011</td><td>08004913</td></tr>
include:<tr><td>05-01-2011</td><td>0123456789</td></tr>

Homework done Even trying my first (start simple) tries to only exclude colspan are all failing:

  1. (?si)<tr.*?>(?!colspan)(.*?)</tr>
  2. (?si)<tr.*?>(.*?)(?!colspan)</tr>
  3. (?si)<tr.*?>.*?[^colspan].*?</tr>
  4. (?si)<tr(\s[^>]*)?>.*?(?!colspan).*?</tr>
  5. (?si)<tr(\s[^>]*)?>.*?(!colspan).*?</tr>
  6. (?si)<tr(\s[^>]*)?>(.*?)(?!colspan)</tr>
  7. (?si)<tr.*?>^(?!.*?colspan=").*?</tr> How to negate specific word in regex? seems related though these suggestions don't result in a match at all.
  8. (?si)<tr.*?>(.(?<!colspan))*?</tr>
  9. (?si)<tr.*?>(?!.*colspan).*</tr> Neither do give do positive and negative lookarounds using http://www.regular-expressions.info/lookaround.html the clue.

How should I correctly write this regex?

Community
  • 1
  • 1
Pro Backup
  • 729
  • 14
  • 34
  • 2
    Just **don't** parse HTML with regexp: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Ondrej Tucny Mar 02 '11 at 22:59
  • 1
    I am aware that I shouldn't parse HTML with a regex. As explained in the question the tool does not give other options than using regex. – Pro Backup Mar 02 '11 at 23:34

2 Answers2

1

The first problem you're having is that your original expressions are very fragile, because of the ".*?>" intended to match everything up to the earliest ">" -- but which will actually match to the following ">" if the rest of the expression fails and backtracks.

Use a construct like "[^>]*>" instead.

The second problem is that you're misinterpreting the meaning of the negative lookahead: it's not checking that the given pattern does not occur ahead of its position -- it's looking ahead of its position to check that the pattern does not occur AT THAT POSITION.

With these two changes, your first attempt was very close to solving your test cases:

$ pcretest
PCRE version 7.8 2008-09-05

  re> /(?si)<tr[^>]*>(?!.*(colspan|>0?0800))(.*?)<\/tr>/
data> <tr><td colspan="2"></td></tr>
No match
data> <tr><td><strong>Date</strong></td><td><strong>Time</strong></td></tr>
 0: <tr><td><strong>Date</strong></td><td><strong>Time</strong></td></tr>
 1: <unset>
 2: <td><strong>Date</strong></td><td><strong>Time</strong></td>
data> <tr><td>05-01-2011</td><td>08004913</td></tr>
No match
data> <tr><td>05-01-2011</td><td>0123456789</td></tr>
 0: <tr><td>05-01-2011</td><td>0123456789</td></tr>
 1: <unset>
 2: <td>05-01-2011</td><td>0123456789</td>

Note this will still fail to solve the whole problem because the presence of a "colspan" or 800 number later in the string will block the match. You need further test cases, such as:

$ pcretest
PCRE version 7.8 2008-09-05

  re> /(?si)<tr[^>]*>(?!.*(colspan|>0?0800))(.*?)<\/tr>/
data> <tr><td>05-01-2011</td><td>0123456789</td></tr><tr><td colspan="2"></td></tr>
No match

So you need to ensure that the negative lookahead never crosses to the next :

$ pcretest
PCRE version 7.8 2008-09-05

  re> /(?si)<tr[^>]*>(?!((?!<\/tr).)*(colspan|>0?0800))(.*?)<\/tr>/
data> <tr><td colspan="2"></td></tr>
No match
data> <tr><td><strong>Date</strong></td><td><strong>Time</strong></td></tr>
 0: <tr><td><strong>Date</strong></td><td><strong>Time</strong></td></tr>
 1: <unset>
 2: <unset>
 3: <td><strong>Date</strong></td><td><strong>Time</strong></td>
data> <tr><td>05-01-2011</td><td>08004913</td></tr>
No match
data> <tr><td>05-01-2011</td><td>0123456789</td></tr>
 0: <tr><td>05-01-2011</td><td>0123456789</td></tr>
 1: <unset>
 2: <unset>
 3: <td>05-01-2011</td><td>0123456789</td>
data> <tr><td>05-01-2011</td><td>0123456789</td></tr><tr><td colspan="2"></td></tr>
 0: <tr><td>05-01-2011</td><td>0123456789</td></tr>
 1: <unset>
 2: <unset>
 3: <td>05-01-2011</td><td>0123456789</td>

At which point one may wonder whether RegExps are the right tool for this particular problem :-)

jsalvata
  • 2,155
  • 15
  • 32
  • Perfect, thanks for the rapid and fitting answer, even including step by step explanation. – Pro Backup Mar 03 '11 at 00:09
  • With your help I could also build the inverse regex, selecting each ... block that contains a specific string. In this example the match for ">Sms" is done with `(?si)]*>(?:((?!<\/tr).)*>Sms<\/td>)(.*?)` – Pro Backup Mar 03 '11 at 15:56
0

I do not know if this is why it is failing but individual cells are td not tr. This should work

(?si)<td(?!colspan)>(.*)</td>
RedSoxFan
  • 634
  • 3
  • 9
  • This is not why it is failing. The intention is to select all text between the start `` and end `` as long as that innerhtml does not contain specific string patterns. And it doesn't matter where exactly (to simplify). – Pro Backup Mar 02 '11 at 23:44
  • 'Pro Backup', I'll have to remember that name –  Mar 03 '11 at 01:20
  • After reading I thought you wanted the information from the individual cells not the rows. If it is in a row then no it doesnt matter, if it is in a cell it does. – RedSoxFan Mar 03 '11 at 21:26