Extracting text from HTML file using Python

Question

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.

Related questions:

For quite a while, people seem to be finding my NLTK answer (quite recent) to be extremely useful so, you might want to consider changing the accepted answer. Thanks! — Shatu, Oct 21 '13 at 18:49
I never thought I'd come across a question asked by the author of my favorite blog! The Endeavor! — Ryan G, Apr 30 '14 at 18:27
@Shatu Now that your solution has become no longer valid, you may want to delete your comment. Thanks! ;) — Sнаđошƒаӽ, Apr 05 '16 at 05:38

score 240 · Answer 1 · edited Mar 10 '21 at 19:27

240

The best piece of code I found for extracting text without getting javascript or not wanted things :

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

You just have to install BeautifulSoup before :

pip install beautifulsoup4

edited Mar 10 '21 at 19:27

MattDMo

100,794
21
241
231

answered Jul 07 '14 at 19:18

PeYoTlL

3,144
2
17
18

3

How if we want to select some line, just said, line #3? – hepidad Aug 26 '14 at 19:19
4

The killing scripts bit, saviour!! – Nanda Nov 17 '14 at 02:12
5

After going through a lot of stackoverflow answers, I feel like this is the best option for me. One problem I encountered is that lines were added together in some cases. I was able to overcome it by adding a separator in get_text function: `text = soup.get_text(separator=' ')` – Joswin K J Sep 02 '15 at 09:54
7

Instead of `soup.get_text()` I used `soup.body.get_text()`, so that I don't get any text from the ` element, such as the title. – Sjoerd Jan 15 '16 at 13:50
I needed soup.getText() – gogasca Jun 16 '16 at 18:20
How to extract the ** **,**<** symbols in the content – Ashok kumar Ganesan Dec 09 '16 at 07:44
10

For Python 3, `from urllib.request import urlopen` – Jacob Kalakal Joseph May 19 '17 at 07:48
This works great! Is there an easy way to extract all the links from the HTML as well, and keep them fairly in line with the corresponding text? – Arya Nov 22 '19 at 22:10
Perfect except for it doesn't break lines at `
` – VBobCat Jan 08 '20 at 21:18
1

Actually you can achieve the same clean result without these manual loops just using two additional standard parameters: `soup.get_text(separator='\n', strip=True)` – DemX86 Jun 16 '20 at 12:42
this seems to be painfully slow, is there any way to do this faster? – kodlan Jun 30 '20 at 14:50
For faster processing I ended up using selectolax lib. It's pretty limited and produced output with additional spaces which I had to remove manually. But it seems to be working much much faster. – kodlan Jun 30 '20 at 16:08
I get the following error when using your code @PeYoTIL: `Traceback (most recent call last): File "c:\Users\easy\Desktop\GreenMail\Main.py", line 15, in soup = BeautifulSoup(html, features="html.parser") File "C:\Users\easy\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\bs4\__init__.py", line 311, in __init__ markup = markup.read() io.UnsupportedOperation: not readable` – Samuel Nihoul Feb 03 '22 at 12:14
How can I add `BeautifulSoup` in Python 3.10 – Hungnn Feb 14 '23 at 14:39

score 173 · Accepted Answer · edited Jun 30 '18 at 20:54

173

html2text is a Python program that does a pretty good job at this.

edited Jun 30 '18 at 20:54

Alireza Savand

3,462
3
26
36

answered Nov 30 '08 at 03:23

RexE

17,085
16
58
81

7

bit it's gpl 3.0 which means it may be incompatible – frog32 Nov 07 '12 at 10:35
173

Amazing! it's author is RIP Aaron Swartz. – Atul Arvind Aug 10 '13 at 07:42
2

Did anyone find any alternatives to html2text because of GPL 3.0? – jontsai Sep 05 '14 at 01:21
1

GPL not as bad as people want it to be. Aaron knew best. – Stephan Kristyn Oct 13 '14 at 10:59
2

I tried both html2text and nltk but they didn't work for me. I ended up going with Beautiful Soup 4, which works beautifully (no pun intended). – Ryan Shea Apr 30 '15 at 18:58
I'm looking for a module for this. Is that what html2text is? – Ecko Feb 01 '16 at 18:31
1

This does not seem to work any more, any updates or suggestions? – David Andrei Ned Dec 15 '16 at 10:53
2

I know that's not (AT ALL) the place, but i follow the link to Aaron's blog and github profile and projects, and found myself very disturbed by the fact there is no mention of his death and it's of course frozen in 2012, as if time stopped or he took a very long vacation. Very disturbing. – julienfr112 Sep 08 '17 at 23:18

Shatu · Answer 3 · 2016-10-22T15:27:39.203

106

NOTE: NTLK no longer supports clean_html function

Original answer below, and an alternative in the comments sections.

Use NLTK

I wasted my 4-5 hours fixing the issues with html2text. Luckily i could encounter NLTK.
It works magically.

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

edited Oct 22 '16 at 15:27

answered Nov 20 '11 at 12:34

Shatu

1,819
3
15
27

1

It just removes HTML markup and does not process any tags (such as `
` and `
`) or entities.
– utapyngo Dec 11 '11 at 12:04
8

sometimes that is enough :) – Sharmila Jan 12 '12 at 10:42
8

I want to up vote this a thousand times. I was stuck in regex hell, but lo, now I see the wisdom of NLTK. – BenDundee Feb 22 '13 at 17:30
27

Apparently, clean_html is not supported anymore: https://github.com/nltk/nltk/commit/39a303e5ddc4cdb1a0b00a3be426239b1c24c8bb – alexanderlukanin13 Aug 22 '13 at 05:51
5

importing a heavy library like nltk for such a simple task would be too much – richie Oct 22 '13 at 09:38
62

@alexanderlukanin13 From the source: `raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")` – Chris Arena Apr 06 '14 at 06:34
@ChrisArena Yes good call, I switched to BeautifulSoup because of this. – Ryan Shea Apr 21 '15 at 20:08
clean_html() and clean_url() is a cute function in NLTK that was dropped since BeautifulSoup does a better job and parsing markup language, see https://github.com/nltk/nltk/commit/39a303e5ddc4cdb1a0b00a3be426239b1c24c8bb Here's BeautifulSoup's documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ – mikelowry Jun 07 '20 at 18:16

score 55 · Answer 4 · answered Oct 21 '10 at 13:14

Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

This seems to be the most straightforward way of doing this in Python (2.7) using only the default modules. Which is really silly, as this is such a commonly needed thing and there's no good reason why there isn't a parser for this in the default HTMLParser module. — Ingmar Hupp, Aug 17 '11 at 22:35
I don't think will convert html characters into unicode, right? For example, `&` won't be converted into `&`, right? — speedplane, Nov 30 '12 at 08:14

Floyd · Answer 5 · 2021-03-01T11:52:59.077

22

I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

from bs4 import BeautifulSoup

text = ' '.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

Update

Based on Fraser's comment, here is more elegant solution:

from bs4 import BeautifulSoup

clean_text = ' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

edited Mar 01 '21 at 11:52

answered Oct 06 '16 at 15:08

Floyd

2,252
19
25

2

To avoid a warning, specify a parser for BeautifulSoup to use: `text = ''.join(BeautifulSoup(some_html_string, "lxml").findAll(text=True))` – Floyd Oct 06 '16 at 15:14
1

You can use the stripped_strings generator to avoid excessive white-space - i.e. `clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings` – Fraser Apr 08 '18 at 04:53
1

I would recomment `' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)` with at least one space, otherwise a string such as `Please click text to continue` is rendered as `Please clicktextto continue` – am70 Feb 28 '21 at 21:03

score 14 · Answer 6 · answered May 07 '13 at 16:04

Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., ') and HTML entities (e.g., &).

It also includes a trivial plain-text-to-html inverse converter.

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

python 3 version: https://gist.github.com/Crazometer/af441bc7dc7353d41390a59f20f07b51 — Crazometer, Sep 15 '16 at 10:00
In get_text, ''.join should be ' '.join. There should be an empty space, otherwise some of the texts will join together. — Obinna Nnenanya, Jan 20 '19 at 21:59
Also, this will not catch ALL texts, except you include other text container tags like H1, H2 ...., span, etc. I had to tweak it for a better coverage. — Obinna Nnenanya, Jan 21 '19 at 11:36

score 11 · Answer 7 · answered Feb 18 '18 at 13:36

I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

If you already have the HTML files downloaded you can do something like this:

article = Article('')
article.set_html(html)
article.parse()
article.text

It even has a few NLP features for summarizing the topics of articles:

article.nlp()
article.summary

score 8 · Answer 8 · answered Sep 23 '09 at 03:21

8

You can use html2text method in the stripogram library also.

from stripogram import html2text
text = html2text(your_html_string)

To install stripogram run sudo easy_install stripogram

answered Sep 23 '09 at 03:21

GeekTantra

11,580
6
41
55

24

This module, according to [its pypi page](http://pypi.python.org/pypi/stripogram), is deprecated: "Unless you have some historical reason for using this package, I'd advise against it!" – intuited Jul 24 '10 at 19:02

score 6 · Answer 9 · answered Nov 29 '12 at 19:28

6

There is Pattern library for data mining.

http://www.clips.ua.ac.be/pages/pattern-web

You can even decide what tags to keep:

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

answered Nov 29 '12 at 19:28

Nuncjo

1,290
3
15
16

PyNEwbie · Answer 10 · 2018-08-16T13:18:06.083

6

PyParsing does a great job. The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing (example link). One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

Having said that, I use BeautifulSoup a lot and it is not that hard to deal with the entities issues, you can convert them before you run BeautifulSoup.

Goodluck

edited Aug 16 '18 at 13:18

answered Nov 30 '08 at 15:46

PyNEwbie

4,882
4
38
86

1

The link is dead or soured. – Aug 11 '18 at 12:42

score 6 · Answer 11 · answered Aug 30 '16 at 11:21

6

if you need more speed and less accuracy then you could use raw lxml.

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

answered Aug 30 '16 at 11:21

Hodza

3,118
26
20

score 5 · Answer 12 · answered May 18 '12 at 10:02

This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

So you could do something like:

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

score 4 · Answer 13 · answered Nov 25 '15 at 13:05

I recommend a Python Package called goose-extractor Goose will try to extract the following information:

Main text of an article Main image of article Any Youtube/Vimeo movies embedded in article Meta Description Meta tags

More :https://pypi.python.org/pypi/goose-extractor/

score 4 · Answer 14 · answered Jan 16 '17 at 14:10

4

Anyone has tried bleach.clean(html,tags=[],strip=True) with bleach? it's working for me.

answered Jan 16 '17 at 14:10

rox

525
7
16

1

Seems to work for me too, but they don't recommend using it for this purpose: "This function is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page." -> https://bleach.readthedocs.io/en/latest/clean.html#bleach.clean – Loktopus Jul 25 '18 at 20:03

score 4 · Answer 15 · answered Apr 05 '17 at 07:16

install html2text using

pip install html2text

then,

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

score 4 · Answer 16 · answered Apr 06 '18 at 03:14

Best worked for me is inscripts .

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

The results are really good

score 4 · Answer 17 · edited Jul 19 '15 at 00:57

Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

NB: HTMLError and HTMLParserError should both read HTMLParseError. This works, but does a bad job of maintaining line breaks. — Dave Knight, Apr 08 '14 at 08:09

score 3 · Answer 18 · answered Nov 30 '12 at 08:23

Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

score 3 · Answer 19 · answered Dec 11 '15 at 04:11

Another non-python solution: Libre Office:

soffice --headless --invisible --convert-to txt input1.html

The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...

kodlan · Answer 20 · 2020-06-30T16:34:58.593

I had a similar question and actually used one of the answers with BeautifulSoup. The problem was it was really slow. I ended up using library called selectolax. It's pretty limited but it works for this task. The only issue was that I had manually remove unnecessary white spaces. But it seems to be working much faster that BeautifulSoup solution.

from selectolax.parser import HTMLParser

def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='')
    text = " ".join(text.split()) # this will remove all the whitespaces
    return text

score 2 · Answer 21 · answered Aug 08 '14 at 02:29

Another option is to run the html through a text based web browser and dump it. For example (using Lynx):

lynx -dump html_to_convert.html > converted_html.txt

This can be done within a python script as follows:

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.

score 2 · Answer 22 · answered Dec 06 '16 at 15:06

@PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using decompose instead of extract but it still didn't work. So I created my own which also formats the text using the <p> tags and replaces <a> tags with the href link. Also copes with links inside text. Available at this gist with a test doc embedded.

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

Thanks, this answer is underrated. For those of us who want to have a clean text representation that behaves more like a browser (ignoring newlines, and only taking paragraphs and line breaks into consideration), BeautifulSoup's `get_text` simply doesn't cut it. — jrial, Apr 17 '18 at 15:03
@jrial glad you found it useful, also thanks for the contrib. For anyone else, the gist linked has been enhanced quite a bit. What the OP seems to allude to is a tool which renders html to text, much like a text based browser like lynx. That's what this solution attempts. What most people are contributing are just text extractors. — racitup, Apr 18 '18 at 21:37
Completely underrated indeed, wow, thank you! Will check the gist too. — rimkashox, Jul 29 '21 at 09:43

score 2 · Answer 23 · answered May 07 '18 at 11:07

I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.

Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.

Uri Goren · Answer 24 · 2019-01-21T19:46:24.607

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you

Should be parsed to:

Hello world
I love you

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

score 1 · Answer 25 · edited May 16 '17 at 20:25

In Python 3.x you can do it in a very easy way by importing 'imaplib' and 'email' packages. Although this is an older post but maybe my answer can help new comers on this post.

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

Now you can print body variable and it will be in plaintext format :) If it is good enough for you then it would be nice to select it as accepted answer.

This shows you how to extract a `text/plain` part from an email if somebody else put one there. It doesn't do anything to convert the HTML into plaintext, and does nothing remotely useful if you are trying to convert HTML from, say, a web site. — tripleee, Nov 27 '17 at 13:23

score 1 · Answer 26 · answered Jun 02 '16 at 15:04

1

in a simple way

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

this code finds all parts of the html_text started with '<' and ending with '>' and replace all found by an empty string

answered Jun 02 '16 at 15:04

David Fraga

11
1

troymyname00 · Answer 27 · 2017-10-25T00:14:19.493

Here's the code I use on a regular basis.

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

I hope that helps.

score 1 · Answer 28 · answered Apr 13 '18 at 11:03

you can extract only text from HTML with BeautifulSoup

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

Mike Q · Answer 29 · 2019-08-28T18:29:11.763

Another example using BeautifulSoup4 in Python 2.7.9+

includes:

import urllib2
from bs4 import BeautifulSoup

Code:

def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

Explained:

Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.

Notes:

Some systems this is run on will fail with https:// connections because of SSL issue, you can turn off the verify to fix that issue. Example fix: http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
Python < 2.7.9 may have some issue running this
text.encode('utf-8') can leave weird encoding, may want to just return str(text) instead.

Cam · Answer 30 · 2023-02-08T16:13:12.393

Answer using Pandas to get table data from HTML.

If you want to extract table data quickly from HTML. You can use the read_HTML function, docs are here. Before using this function you should read the gotchas/issues surrounding the BeautifulSoup4/html5lib/lxml parsers HTML parsing libraries.

import pandas as pd

http = r'https://www.ibm.com/docs/en/cmofz/10.1.0?topic=SSQHWE_10.1.0/com.ibm.ondemand.mp.doc/arsa0257.htm'
table = pd.read_html(http)
df = table[0]
df

output

There are a number of option that can be played with see here and here.

score 1 · Answer 31 · answered Sep 18 '22 at 20:12

If you want to automatically extract text passages from a webpage there are some python packages available such as Trafilatura. As part of its benchmarking several python packages have been compared:

https://github.com/adbar/trafilatura#evaluation-and-alternatives

html_text https://github.com/TeamHG-Memex/html-text
inscriptis https://github.com/weblyzard/inscriptis
newspaper3k
justext
boilerpy3 https://github.com/jmriebold/BoilerPy3
baseline
goose3 https://github.com/goose3/goose3
readability-lxml https://github.com/predatell/python-readability-lxml
news-please https://github.com/fhamborg/news-please
readabilipy https://github.com/alan-turing-institute/ReadabiliPy
trafilatura

score 0 · Answer 32 · answered Aug 07 '16 at 17:27

0

I am achieving it something like this.

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

answered Aug 07 '16 at 17:27

Waqar Detho

1,502
18
17

I am using python 3.4 and this code is working fine for me. – Waqar Detho Oct 18 '16 at 20:54
text would have html tags in it – Ivelin Nov 29 '16 at 12:26

score 0 · Answer 33 · edited May 15 '18 at 22:54

The LibreOffice writer comment has merit since the application can employ python macros. It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice. If this resolution is a one-off implementation, rather than to be used as part of a greater production program, opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here.

score 0 · Answer 34 · answered Jul 06 '18 at 11:36

0

Perl way (sorry mom, i'll never do it in production).

import re

def html2text(html):
    res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
    res = re.sub('\n+', '\n', res)
    res = re.sub('\r+', '', res)
    res = re.sub('[\t ]+', ' ', res)
    res = re.sub('\t+', '\t', res)
    res = re.sub('(\n )+', '\n ', res)
    return res

answered Jul 06 '18 at 11:36

brunql

415
5
5

This is bad practice for so many reason, for example ` ` – Uri Goren Jan 21 '19 at 11:11
Yes! It's true! Don't do it anythere! – brunql Jan 22 '19 at 12:50

Haider · Answer 35 · 2021-07-28T01:47:31.987

All methods here did not work quite well with some websites. The paragraphs that are generated by the JS code were resistant to all the above. Here is what eventually worked for me inspired by this answer and this.

The idea is to load the page in webdriver and scroll to the end of the page to make JS do its thing to generate/load the rest of the page. Then insert keystroke commands to select all copy/paste the whole page:

import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pyperclip
import time

driver = webdriver.Chrome()
driver.get("https://www.lazada.com.ph/products/nike-womens-revolution-5-running-shoes-black-i1262506154-s4552606107.html?spm=a2o4l.seller.list.3.6f5d7b6cHO8G2Y&mp=1&freeshipping=1")

# Scroll down to end of the page to let all javascript code load its content
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
        lastCount = lenOfPage
        time.sleep(1)
        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
            match=True

# copy from the webpage
element = driver.find_element_by_tag_name('body')
element.send_keys(Keys.CONTROL,'a')
element.send_keys(Keys.CONTROL,'c')
alltext = pyperclip.paste()
alltext = alltext.replace("\n", " ").replace("\r", " ")  # cleaning the copied text
print(alltext )

It is slow. But nothing else did work out.

UPDATE: A better method is to load the source of the page AFTER scrolling to the end of the page using inscriptis library:

from inscriptis import get_text
text = get_text(driver.page_source)

Still will not work with a headless driver (page detects somehow that it is not shown by real and scroll to end will not make JS code loading its thing), but at least we don't need the crazy copy/paste which hinders us from running multiple scripts on a machine with a shared clipboard.

Alan Hamlett · Answer 36 · 2023-07-18T17:25:59.737

I like using pyquery to solve this:

from pyquery import PyQuery as pq


def html_to_text(html):
    """Return a list of the visible utf8 text for some HTML string."""

    if not html:
        return []

    if not isinstance(html, pq):
        html = pq(html)

    skip = ['style', 'title', 'noscript', 'head', 'meta']

    text = []

    try:
        if html.tag and html.tag.lower() in skip:
            return []
    except AttributeError:
        pass

    try:
        style = dict([y.strip() for y in x.strip().split(":")] for x in html.attr.style.split(";") if x.strip())
        if style["display"].lower() == "none":
            return []
    except (AttributeError, KeyError):
        pass

    for el in html:
        try:
            if not el.tag or el.tag.lower() in skip:
                continue
        except AttributeError:
            continue

        for child in el.getchildren():
            text.extend(html_to_text(child))

        if not el.text:
            continue

        text.append(el.text)

    return text


print(" ".join(html_to_text("<p>test</p>")))

Extracting text from HTML file using Python

36 Answers36

Update

Linked

Related