0

There are many answers to how to convert HTML to text using BeautifulSoup (for example https://stackoverflow.com/a/24618186/3946214)

There are also many answers on how to extract links from HTML using BeautifulSoup.

What I need is a way to turn HTML into a text only version, but preserve links inline with the text that's near the link. For example, if I had some HTML that looked like this:

<div>Click <a href="www.google.com">Here</a> to receive a quote</div>

It would be nice to convert this to "Click Here (www.google.com) to receive a quote."

The usecase here is that I need to convert HTML for emails into a text only version, and it would be nice to have the links where they are semantically located in the HTML, instead of at the bottom. This exact syntax isn't required. I'd appreciate any guidance into how to do this. Thank you!

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
Arya
  • 1,382
  • 2
  • 15
  • 36

2 Answers2

1
import html2text

data = """
<div>Click <a href="www.google.com">Here</a> to receive a quote</div>
"""


print(html2text.html2text(data))

Output:

Click [Here](www.google.com) to receive a quote
  • Thank you, this library is great but unfortunately its license is not compatible with the company I work for :( – Arya Dec 10 '19 at 00:34
1

If you want beautifulsoup solution, you can start with this example (it probably needs more tuning with real-world data):

data = '<div>Click <a href="www.google.com">Here</a> to receive a quote.</div>'

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

# append the text to the link
for a in soup.select('a[href]'):
    a.contents.append(soup.new_string(' ({})'.format(a['href'])))

# unwrap() all tags
for tag in soup.select('*'):
    tag.unwrap()

print(soup)

Prints:

Click Here (www.google.com) to receive a quote.
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91