1

My question is very simple, i have a string that contains html tags and i just want to get the actual text value from that string, example:

html string:

<strong><p> hello </p><p> world </p></strong>

text value: hello world

Is there a function that can do that ?

Rachid O
  • 13,013
  • 15
  • 66
  • 92

1 Answers1

3

You can use BeautifulSoup's get_text() function:

from bs4 import BeautifulSoup


text = "<strong><p> hello </p><p> world </p></strong>"

soup = BeautifulSoup(text)
print soup.get_text()  # prints " hello  world "

Or, you can use nltk:

import nltk


text = "<strong><p> hello </p><p> world </p></strong>"
print nltk.clean_html(text)  # prints "hello world"

Another option is to use html2text, but it behaves a bit defferently: e.g. strong is replaced with *.

Also see relevant thread: Extracting text from HTML file using Python

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • thanks BeautifulSoup's function works good but one more question when i try to print the resulted text it gives me this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 47: ordinal not in range(128) , ps: i'm working with french text that contains accents – Rachid O Aug 27 '13 at 19:14
  • 2
    don't bother i found the solution here http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 – Rachid O Aug 27 '13 at 19:29