Python Regex parsing with HTML inside HTML cells

Question

Edit: I noticed that this has been downvoted as a duplicate, however, it is not, as a the duplicate solution involves the usage of beautifulsoup for parsing. I understand that beautifulsoup is a better solution to this problem, but for the sake of learning, I have been trying to use Regex.

I'm a novice with Regex and am working on a Python-based Regex parser for HTML tables. So far, I have managed to generate patterns that correctly parse normal rows, cells, and headers, but am looking to modify my Regex to accommodate for HTML within cells and headers. Essentially, I am looking to leave HTML code that's within a larger cell unevaluated, doing something like this:

found = re.findall(isHeader,"<th>Student</th> Name</th>")
found = "Student</th> Name"

After doing some research, I am trying to approach the problem using a look-ahead:

isHeader = r'<th\s*>([\S\s]*?)</th\s*>(?!(?:</th\s*>))'

This Regex is an attempt at isolating a string that begins with "<th>", and ends with "</th>", provided there are no more "</th>"s in that same pattern before the next pattern begins. The pattern successfully isolates "proper" headers (with no </th>s in the header itself), but fails to parse "improper" headers correctly, stopping the string at the first </th> found.

I'm assuming my look ahead has been incorrectly implemented. Any advice would be greatly appreciated.

Thank you!

score 2 · Accepted Answer · answered Nov 27 '17 at 04:54

2

How about something like this:

(?<=<th>).*(?=<\/th>)

Demo: https://regex101.com/r/HiL3Zi/1

answered Nov 27 '17 at 04:54

Henry

141
1
5

Thanks for that Henry. It works perfectly, except that it needs to follow the html convention of allowing unlimited whitespace after the "th" (ie ). I can easily account for that in the second group of your regex, but can't in the first due to the non-fixed width look-behind. Any idea how to get around that issue? Thanks again! – Ben Nov 27 '17 at 06:17
Analyzing yours, I'm thinking you might just need to make the regex "greedy" instead of "lazy": `([\S\s]*)<\/th\s*>(?!(?:<\/th\s*>))` Demo: https://regex101.com/r/HiL3Zi/2 – Henry Nov 27 '17 at 13:21

Python Regex parsing with HTML inside HTML cells

1 Answers1