Python, get text value of an html document

Question

My question is very simple, i have a string that contains html tags and i just want to get the actual text value from that string, example:

html string:

<strong><p> hello </p><p> world </p></strong>

text value: hello world

Is there a function that can do that ?

score 3 · Accepted Answer · edited May 23 '17 at 12:21

3

You can use BeautifulSoup's get_text() function:

from bs4 import BeautifulSoup


text = "<strong><p> hello </p><p> world </p></strong>"

soup = BeautifulSoup(text)
print soup.get_text()  # prints " hello  world "

Or, you can use nltk:

import nltk


text = "<strong><p> hello </p><p> world </p></strong>"
print nltk.clean_html(text)  # prints "hello world"

Another option is to use html2text, but it behaves a bit defferently: e.g. strong is replaced with *.

Also see relevant thread: Extracting text from HTML file using Python

Hope that helps.

edited May 23 '17 at 12:21

Community

1
1

answered Aug 27 '13 at 19:00

alecxe

462,703
120
1,088
1,195

thanks BeautifulSoup's function works good but one more question when i try to print the resulted text it gives me this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 47: ordinal not in range(128) , ps: i'm working with french text that contains accents – Rachid O Aug 27 '13 at 19:14
2

don't bother i found the solution here http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 – Rachid O Aug 27 '13 at 19:29

Python, get text value of an html document

1 Answers1