Edit: I noticed that this has been downvoted as a duplicate, however, it is not, as a the duplicate solution involves the usage of beautifulsoup for parsing. I understand that beautifulsoup is a better solution to this problem, but for the sake of learning, I have been trying to use Regex.
I'm a novice with Regex and am working on a Python-based Regex parser for HTML tables. So far, I have managed to generate patterns that correctly parse normal rows, cells, and headers, but am looking to modify my Regex to accommodate for HTML within cells and headers. Essentially, I am looking to leave HTML code that's within a larger cell unevaluated, doing something like this:
found = re.findall(isHeader,"<th>Student</th> Name</th>")
found = "Student</th> Name"
After doing some research, I am trying to approach the problem using a look-ahead:
isHeader = r'<th\s*>([\S\s]*?)</th\s*>(?!(?:</th\s*>))'
This Regex is an attempt at isolating a string that begins with "<th>
", and ends with "</th>
", provided there are no more "</th>
"s in that same pattern before the next pattern begins. The pattern successfully isolates "proper" headers (with no </th>
s in the header itself), but fails to parse "improper" headers correctly, stopping the string at the first </th>
found.
I'm assuming my look ahead has been incorrectly implemented. Any advice would be greatly appreciated.
Thank you!