Converting html to text with Python

Question

I am trying to convert an html block to text using Python.

Input:

<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>

Desired output:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa

Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

I tried the html2text module without much success:

#!/usr/bin/env python

import urllib2
import html2text
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())

txt = soup.find('div', {'class' : 'body'})

print(html2text.html2text(txt))

The txt object produces the html block above. I'd like to convert it to text and print it on the screen.

Do you have to use Python? `lynx -dump filename.html` will do this. http://lynx.browser.org/ Also, you could use an XPath expression and http://www.w3.org/Tools/HTML-XML-utils/. — Dave Jarvis, Feb 04 '13 at 20:01

score 135 · Accepted Answer · edited Nov 16 '20 at 18:08

135

soup.get_text() outputs what you want:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())

output:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

To keep newlines:

print(soup.get_text('\n'))

To be identical to your example, you can replace a newline with two newlines:

soup.get_text().replace('\n','\n\n')

edited Nov 16 '20 at 18:08

Rob Bednark

25,981
23
80
125

answered Feb 04 '13 at 20:06

root

76,608
25
108
120

3

soup.get_text() is exactly what I needed. Thank you! – Aaron Bandelli Feb 04 '13 at 21:22
1

BeautifulSoup 4 only, unfortunately. – palswim Oct 15 '18 at 05:49
3

This saved my day! I update the response to for Python3 and incorporating @t-8ch's newline idea. – caram Oct 10 '19 at 07:44

score 36 · Answer 2 · edited Jan 24 '20 at 14:18

36

It's possible using python standard html.parser:

from html.parser import HTMLParser

class HTMLFilter(HTMLParser):
    text = ""
    def handle_data(self, data):
        self.text += data

f = HTMLFilter()
f.feed(data)
print(f.text)

edited Jan 24 '20 at 14:18

julienc

19,087
17
82
82

answered Apr 24 '19 at 08:03

FrBrGeorge

610
7
8

3

This answer works great with no 3rd party package dependencies! my PyCharm editor hinted at me that I will need to use ABC mixin so that it get rid off the all abstract methods need to be implemented error. https://gist.github.com/ye/050e898fbacdede5a6155da5b3db078d – Devy Nov 11 '19 at 17:36
3

Note that initializing the `text` class attribute and assigning the `self.text` instance attribute is un-Pythonic, but does work here due to the reassignment. If one was to e.g. use a mutable list instead (`pieces = []` and `self.pieces.append(data)`), all instances of the class would share the same list object. – akaihola May 15 '20 at 12:59
2

Great answer! Although `html` is not available as part of the python2 standard library. So this solution only works for python3. – David Ross Oct 13 '21 at 12:13
I get some javascript in the returned text but standard lib only is still nice – ndemou Jun 25 '22 at 09:58
I'm getting "TypeError: can only concatenate str (not "AttribAccessDict") to str" with this now – yuletide Dec 21 '22 at 07:14
This doesn't work if there is invalid closing of html tag, though beautifulsoup works well for that case. – discover Jun 30 '23 at 10:52

score 8 · Answer 3 · answered Mar 18 '21 at 11:57

The main problem is how you keep some basic formatting. Here is my own minimal approach to keep new lines and bullets. I am sure it's not the solution to everything you want to keep but it's a starting point:

from bs4 import BeautifulSoup

def parse_html(html):
    elem = BeautifulSoup(html, features="html.parser")
    text = ''
    for e in elem.descendants:
        if isinstance(e, str):
            text += e.strip()
        elif e.name in ['br',  'p', 'h1', 'h2', 'h3', 'h4','tr', 'th']:
            text += '\n'
        elif e.name == 'li':
            text += '\n- '
    return text

The above adds a new line for 'br', 'p', 'h1', 'h2', 'h3', 'h4','tr', 'th' and a new line with - in front of text for li elements

score 7 · Answer 4 · edited Nov 16 '20 at 18:09

You can use a regular expression, but it's not recommended. The following code removes all the HTML tags in your data, giving you the text:

import re

data = """<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""

data = re.sub(r'<.*?>', '', data)

print(data)

Output

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ;-) — Dave Jarvis, Feb 04 '13 at 20:04
@DaveJarvis Lol... here the OP doesn't want to do anything with HTML as such, he just wants the HTML plucked out altogether. — ATOzTOA, Feb 04 '13 at 20:06
Still, http://stackoverflow.com/a/1732454/517371 is very much relevant. There are more things wrong with `/<.*?>/` than I could possibly enumerate here in 600 characters. — Tobia, Apr 02 '14 at 15:58

score 5 · Answer 5 · answered Feb 04 '13 at 20:11

5

The '\n' places a newline between the paragraphs.

from bs4 import Beautifulsoup

soup = Beautifulsoup(text)
print(soup.get_text('\n'))

answered Feb 04 '13 at 20:11

t-8ch

2,583
14
18

1

In also places newlines in the middle of sentences if you have e.g. `"
That's not what I want
"` – remram Jan 26 '14 at 16:45
I still see XML elements in the output (although those are not strict HTML elements, like `[if gte mso 9]>...` . How can I filter those out also? – Csaba Toth Sep 29 '16 at 20:40
I also want to convert any HTML characters, like ` ` or `©` – Csaba Toth Sep 29 '16 at 20:42
Same issue as @CsabaToth here. – caram Oct 10 '19 at 07:33

Mark Chackerian · Answer 6 · 2020-06-03T19:09:29.257

I liked @FrBrGeorge's no dependency answer so much that I expanded it to only extract the body tag and added a convenience method so that HTML to text is a single line:

from abc import ABC
from html.parser import HTMLParser


class HTMLFilter(HTMLParser, ABC):
    """
    A simple no dependency HTML -> TEXT converter.
    Usage:
          str_output = HTMLFilter.convert_html_to_text(html_input)
    """
    def __init__(self, *args, **kwargs):
        self.text = ''
        self.in_body = False
        super().__init__(*args, **kwargs)

    def handle_starttag(self, tag: str, attrs):
        if tag.lower() == "body":
            self.in_body = True

    def handle_endtag(self, tag):
        if tag.lower() == "body":
            self.in_body = False

    def handle_data(self, data):
        if self.in_body:
            self.text += data

    @classmethod
    def convert_html_to_text(cls, html: str) -> str:
        f = cls()
        f.feed(html)
        return f.text.strip()

See comment for usage.

This converts all of the text inside the body, which in theory could include style and script tags. Further filtering could be achieved by extending the pattern of as shown for body -- i.e. setting instance variables in_style or in_script.

score 3 · Answer 7 · answered Sep 15 '20 at 09:50

There are some nice things here, and i might as well throw in my solution:

from html.parser import HTMLParser
def _handle_data(self, data):
    self.text += data + '\n'

HTMLParser.handle_data = _handle_data

def get_html_text(html: str):
    parser = HTMLParser()
    parser.text = ''
    parser.feed(html)

    return parser.text.strip()

score 3 · Answer 8 · answered Aug 19 '22 at 13:06

There is a library called inscripts really simple and light and can get its input from a file or directly from an URL:

from inscriptis import get_text
text = get_text(html)
print(text)

The output is:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa

Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

score 2 · Answer 9 · answered Oct 09 '20 at 20:38

gazpacho might be a good choice for this!

Input:

from gazpacho import Soup

html = """\
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
"""

Output:

text = Soup(html).strip(whitespace=False) # to keep "\n" characters intact
print(text)

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

score 1 · Answer 10 · answered Sep 25 '14 at 20:47

I was in need of a way of doing this on a client's system without having to download additional libraries. I never found a good solution, so I created my own. Feel free to use this if you like.

import urllib 

def html2text(strText):
    str1 = strText
    int2 = str1.lower().find("<body")
    if int2>0:
       str1 = str1[int2:]
    int2 = str1.lower().find("</body>")
    if int2>0:
       str1 = str1[:int2]
    list1 = ['<br>',  '<tr',  '<td', '</p>', 'span>', 'li>', '</h', 'div>' ]
    list2 = [chr(13), chr(13), chr(9), chr(13), chr(13),  chr(13), chr(13), chr(13)]
    bolFlag1 = True
    bolFlag2 = True
    strReturn = ""
    for int1 in range(len(str1)):
      str2 = str1[int1]
      for int2 in range(len(list1)):
        if str1[int1:int1+len(list1[int2])].lower() == list1[int2]:
           strReturn = strReturn + list2[int2]
      if str1[int1:int1+7].lower() == '<script' or str1[int1:int1+9].lower() == '<noscript':
         bolFlag1 = False
      if str1[int1:int1+6].lower() == '<style':
         bolFlag1 = False
      if str1[int1:int1+7].lower() == '</style':
         bolFlag1 = True
      if str1[int1:int1+9].lower() == '</script>' or str1[int1:int1+11].lower() == '</noscript>':
         bolFlag1 = True
      if str2 == '<':
         bolFlag2 = False
      if bolFlag1 and bolFlag2 and (ord(str2) != 10) :
        strReturn = strReturn + str2
      if str2 == '>':
         bolFlag2 = True
      if bolFlag1 and bolFlag2:
        strReturn = strReturn.replace(chr(32)+chr(13), chr(13))
        strReturn = strReturn.replace(chr(9)+chr(13), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(32), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(9), chr(13))
        strReturn = strReturn.replace(chr(13)+chr(13), chr(13))
    strReturn = strReturn.replace(chr(13), '\n')
    return strReturn


url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis"    
html = urllib.urlopen(url).read()    
print html2text(html)

The reason for downvote is the improper indentation. As the code has a medium complexity. It's a little hard to fix it. — WaterRocket8236, Feb 22 '17 at 03:04

score 1 · Answer 11 · answered Dec 12 '17 at 22:58

It's possible to use BeautifulSoup to remove unwanted scripts and similar, though you may need to experiment with a few different sites to make sure you've covered the different types of things you wish to exclude. Try this:

from requests import get
from bs4 import BeautifulSoup as BS
response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
soup = BS(response.content, "html.parser")
for child in soup.body.children:
   if child.name == 'script':
       child.decompose() 
print(soup.body.get_text())

score 1 · Answer 12 · answered Oct 29 '21 at 11:39

I personally like Gazpacho solution by emehex, but it only use regular expression for filtering out the tags. No more magic. This means that solution keep text inside <style> and <script>.

So I would rather implement a simple solution based on regular expressions and use standard Python 3.4 library for unescape HTML entities:

import re
from html import unescape

def html_to_text(html):

    # use non-greedy for remove scripts and styles
    text = re.sub("<script.*?</script>", "", html, flags=re.DOTALL)
    text = re.sub("<style.*?</style>", "", text, flags=re.DOTALL)

    # remove other tags
    text = re.sub("<[^>]+>", " ", text)

    # strip whitespace
    text = " ".join(text.split())

    # unescape html entities
    text = unescape(text)

    return text

Of course, this does not error prove as BeautifulSoup or other parsers solutions. But you don't need any 3rd party package.

score 1 · Answer 13 · answered Jan 28 '23 at 14:14

1

I don't know who wrote this Library but, bless his/her heart.

answered Jan 28 '23 at 14:14

Otobong Jerome

401
6
5

score 1 · Answer 14 · answered Feb 12 '23 at 12:37

An updated answer based on Andreas' answer.

def parse_html(html):
    elem = BeautifulSoup(html, features="html.parser")
    text = ''
    for e in elem.descendants:
        if isinstance(e, str):
            text += e.get_text().strip()
        elif e.name in ['span']:
            text += ' '
        elif e.name in ['br',  'p', 'h1', 'h2', 'h3', 'h4', 'tr', 'th', 'div']:
            text += '\n'
        elif e.name == 'li':
            text += '\n- '
    return text

Why? Some XML code was still leaking inside, spans were stripped and didnt have enough space, and divs sometimes require more line breaks. Everything else is the same.

score 0 · Answer 15 · answered Oct 25 '21 at 13:48

A two-step lxml-based approach with markup sanitizing before converting to plain text.

The script accepts either a path to an HTML file or piped stdin.

Will remove script blocks and all possibly undesired text. You can configure the lxml Cleaner instance to suit your needs.

#!/usr/bin/env python3

import sys
from lxml import html
from lxml.html import tostring
from lxml.html.clean import Cleaner


def sanitize(dirty_html):
    cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )

    return cleaner.clean_html(dirty_html)


if len(sys.argv) > 1:
  fin = open(sys.argv[1], encoding='utf-8')
else:
  fin = sys.stdin

source = fin.read()
source = sanitize(source)
source = source.replace('<br>', '\n')

tree = html.fromstring(source)
plain = tostring(tree, method='text', encoding='utf-8')

print(plain.decode('utf-8'))

score 0 · Answer 16 · answered Jun 28 '22 at 23:46

0

I encountered the same problem using Scrapy you may try adding this to settings.py

#settings.py
FEED_EXPORT_ENCODING = 'utf-8'

answered Jun 28 '22 at 23:46

Jaypee Tan

101
1
10

score 0 · Answer 17 · answered Jul 25 '23 at 07:22

from lxml import html as html_module


def html_2_text(html_content):
    tree = html_module.fromstring(html_content)
    # text_list = tree.xpath('//text()')
    # text_list = tree.xpath('//text()[not(ancestor::script)]')
    text_list = tree.xpath('//text()[not(ancestor::script) and normalize-space()]')
    text_list = [text.strip() for text in text_list]
    return "\n".join(text for text in text_list if text!="")

score -1 · Answer 18 · answered Jan 18 '22 at 08:02

-1

from html.parser import HTMLParser

class HTMLFilter(HTMLParser):
    text = ''
    def handle_data(self, data):
        self.text += f'{data}\n'

def html2text(html):
    filter = HTMLFilter()
    filter.feed(html)

    return filter.text

content = html2text(content_temp)

answered Jan 18 '22 at 08:02

Ivy Chiu

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 18 '22 at 08:06

Converting html to text with Python

18 Answers18

Linked