3

I am trying to scrape a table like this:

<table><tr>
<td width="100"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My title example:</span></p></td>
<td width="440"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My text example.</span></p></td>
</tr>
<tr>
<td width="100">My second title:</p></td>
<td width="440"><p>My <span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt; text-decoration: underline;">second</span> text example.</p></td>
</tr></table>

To show the output in a simple list of dictionaries like this:

[
{"title": "My title example", "text": "My text example"},
{"title": "My other example", "text": "My <u>second</u> text example"},
{"title": "My title example", "text": "My new example"},
]

But I need to sanitize the code and swap the underline sections to tags. So this is the code that I have so far:

from bs4 import BeautifulSoup
import re
# Find the rows in the table
for table_row in html.select("table tr"):
    cells = table_row.findAll('td')
    if len(cells) > 0:
        row_title = cells[0].text.strip()
        paragraphs = []
        # Find all spans in a row
        for run in cells[1].findAll('span'):
            print(run)
            if "text-decoration: underline" in str(run):
                paragraphs.append("{0}{1}{2}".format("<u>", run.text, "</u>"))
            else:
                paragraphs.append(run.text)
        # Build up a sanitized string with all the runs.
        row_text = "".join(paragraphs)
        row = {"title": row_title, "text": row_text}
        data.append(row)
print(data)

The issue: As you may noticed, it scrapes the row with spans perfectly (the first example) but it fails on the second one and it only scrapes the underline parts (because the text is not inside span tags). So I was thinking that instead of trying to find spans, I would just remove all the spans and replace the ones that I need with Regex, something like this:

# Find all runs in a row
for paragraph in cells[1].findAll('p'):
    re.sub('<.*?>', '', str(paragraph))

And that would create text with no tags, but also without underline formatting, and that's where I am stuck.

I don't know how to add such a condition on regex. Any help is welcome.

Expected output: Remove all tags from paragraph but replace spans where text-decoration: underline is found with <u></u> tags.

Saelyth
  • 1,694
  • 2
  • 25
  • 42

2 Answers2

3

One idea would be to use .replace_with() to replace the "underline" span elements with the u elements and then use .encode_contents() to get the inner HTML of the "text" cells:

result = []
for row in soup.select("table tr"):
    title_cell, data_cell = row('td')[:2]

    for span in data_cell('span'):
        if 'underline' in span.get('style', ''):
            u = soup.new_tag("u")
            u.string = span.get_text()
            span.replace_with(u)
        else:
            # replacing the "span" element with its contents
            span.unwrap()

    # replacing the "p" element with its contents
    data_cell.p.unwrap()

    result.append({
        "title": title_cell.get_text(strip=True),
        "test": str(data_cell.encode_contents())
    })
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

When you find a <span> tag with the underline attribute, you can change its text to add the <u>...</u> tags using span.string = '<u>{}</u>'.format(span.text). After modifying the text, you can remove the <span> tag using unwrap().

result = []
for row in soup.select('table tr'):
    columns = row.find_all('td')
    title = columns[0]
    txt = columns[1]
    for span in txt.find_all('span', style=lambda s: 'text-decoration: underline' in s):
        span.string = '<u>{}</u>'.format(span.text)
        span.unwrap()

    result.append({'title': title.text, 'text': txt.text})

print(result)
# [{'title': 'My title example:', 'text': 'My text example.'}, {'title': 'My second title:', 'text': 'My <u>second</u> text example.'}]

Note: This approach won't actually change the tag. It modifies the string and removes the tag.

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
  • What if you have a lot of spans, and only some of them are underline, and some others aren't? I think your code would add to every span, instead of just the ones underlined. – Saelyth Apr 14 '18 at 17:05
  • Yes, you are slightly right. If there are multiple span tags, this will modify only the first span tag. I'll edit my answer to handle multiple span tags. – Keyur Potdar Apr 14 '18 at 17:06
  • Have a look at the edit. This will change all the span tags which are underlined and do nothing to the others. – Keyur Potdar Apr 14 '18 at 17:11
  • I'm using it to parse a RichText editor with the only possibility to add underlined text, so yeah. There will be a lot of "runs" in a big text, different size of text, fonts, etc. The parser should just get the text and the underline, no matter what everything else is there. There could be several spans of different types on the columns. – Saelyth Apr 14 '18 at 17:12
  • I am getting `too many values to unpack` at `title, txt = row.find_all('td')` (Might be because the table actually have more columns? even though I just need to parse the first 2). – Saelyth Apr 14 '18 at 17:34
  • Yes, just use indices to assign them if there are more columns. I followed your example where there were only 2 columns. I'll make the edit. – Keyur Potdar Apr 14 '18 at 17:36
  • 1
    It works. I will accept your answer because your help throught all my comments has been unvaluable and really helpful. Nice job! Also, keeping the structure of the code in this way allows me to edit it easier in the future. – Saelyth Apr 14 '18 at 17:53