I'm using regex in python to grab the following data from HTML in this line:
<td xyz="123"><a href="blah.html">This is a line</a></td>
The problem is that in the above td line, the xyz="123"
and <a href>
are optional, so it does not appear in all the table cells. So I can have tds like this:
<tr><td>New line</td></tr>
<tr><td xyz="123"><a href="blah.html">CaptureThis</a></td></tr>
I wrote regex like this:
<tr><td x?y?z?=?"?(\d\d\d)?"?>?<?a?.*?>?(.*?)?<?/?a?>?</td></tr>
I basically want to capture the "123" data (if present) and the "CaptureThis" data from all tds in each tr.
This regex is not working, and is skipping the the lines without "xyz" data.
I know using regex is not the apt solution here, but was wondering if it could be done with regex alone.