I am trying to scrape a table like this:
<table><tr>
<td width="100"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My title example:</span></p></td>
<td width="440"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My text example.</span></p></td>
</tr>
<tr>
<td width="100">My second title:</p></td>
<td width="440"><p>My <span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt; text-decoration: underline;">second</span> text example.</p></td>
</tr></table>
To show the output in a simple list of dictionaries like this:
[
{"title": "My title example", "text": "My text example"},
{"title": "My other example", "text": "My <u>second</u> text example"},
{"title": "My title example", "text": "My new example"},
]
But I need to sanitize the code and swap the underline sections to tags. So this is the code that I have so far:
from bs4 import BeautifulSoup
import re
# Find the rows in the table
for table_row in html.select("table tr"):
cells = table_row.findAll('td')
if len(cells) > 0:
row_title = cells[0].text.strip()
paragraphs = []
# Find all spans in a row
for run in cells[1].findAll('span'):
print(run)
if "text-decoration: underline" in str(run):
paragraphs.append("{0}{1}{2}".format("<u>", run.text, "</u>"))
else:
paragraphs.append(run.text)
# Build up a sanitized string with all the runs.
row_text = "".join(paragraphs)
row = {"title": row_title, "text": row_text}
data.append(row)
print(data)
The issue: As you may noticed, it scrapes the row with spans perfectly (the first example) but it fails on the second one and it only scrapes the underline parts (because the text is not inside span tags). So I was thinking that instead of trying to find spans, I would just remove all the spans and replace the ones that I need with Regex, something like this:
# Find all runs in a row
for paragraph in cells[1].findAll('p'):
re.sub('<.*?>', '', str(paragraph))
And that would create text with no tags, but also without underline formatting, and that's where I am stuck.
I don't know how to add such a condition on regex. Any help is welcome.
Expected output: Remove all tags from paragraph but replace spans where text-decoration: underline
is found with <u></u>
tags.