Context
The case is screen scraping web content using QuotaXML SDK 1.6 to finally display the data on the dashboard and on the iPhone.
This QuotaXML tool offers regex for extracting table data only.
QuotaXML does parse html tables using a three step approach.
1. First it identifies the table, for example using "(?si)<table.*?>(.*?)</table>
"
2. Second within this parsed table it identifies rows, like "(?si)<tr.*?>(.*?)</tr>
"
3. Third within this row scope, individual cells are identified like "(?si)<tr.*?>(.*?)</tr>
"
The problem
The source html contains some rows that are not relevant data like lines or images that span full table width using a colspan.
Or tables contain data cells which are not relevant to the data lines needed, like call detail records which also contain calls to freephones which are not substracted from the minutes in your plan, in this case 0800 and 00800 numbers.
In other words (.*?)
may not match ' colspan="' neither '>0800' neither '>00800'.
In code:
exclude:<tr><td colspan="2"></td></tr>
include:<tr><td><strong>Date</strong></td><td><strong>Time</strong></td></tr>
exclude:<tr><td>05-01-2011</td><td>08004913</td></tr>
include:<tr><td>05-01-2011</td><td>0123456789</td></tr>
Homework done Even trying my first (start simple) tries to only exclude colspan are all failing:
(?si)<tr.*?>(?!colspan)(.*?)</tr>
(?si)<tr.*?>(.*?)(?!colspan)</tr>
(?si)<tr.*?>.*?[^colspan].*?</tr>
(?si)<tr(\s[^>]*)?>.*?(?!colspan).*?</tr>
(?si)<tr(\s[^>]*)?>.*?(!colspan).*?</tr>
(?si)<tr(\s[^>]*)?>(.*?)(?!colspan)</tr>
(?si)<tr.*?>^(?!.*?colspan=").*?</tr>
How to negate specific word in regex? seems related though these suggestions don't result in a match at all.(?si)<tr.*?>(.(?<!colspan))*?</tr>
(?si)<tr.*?>(?!.*colspan).*</tr>
Neither do give do positive and negative lookarounds using http://www.regular-expressions.info/lookaround.html the clue.
How should I correctly write this regex?