How to get pure text from a string in python?

Question

There is a string with many html labels in it such as the following,
u'find /home/tiger/workspace  -name "[0-9]*" find /home/tiger/workspace  -name "[!0-9]*" find /home/tiger/workspace  -name "[^0-9]*" \u627e\u51fa\u6240\u6709\u5305\u542b\u6570\u5b57\u7684\u6587\u4ef6\uff0c\u4e0d\u5305\u542b\u6570\u5b57\u7684\u6587\u4ef6\u3002 tiger@debian:~$ find /home/tiger  -name "*[0-9]*"  >kan1 tiger@debian:~$ find /home/tiger  -name "[0-9]*"  >kan2 tiger@debian:~$ find /home/tiger  -name "*[0-9]"  >kan3 \u5305\u542b\u6570\u5b57\uff0c\u6570\u5b57\u5f00\u5934\uff0c\u6570\u5b57\u7ed3\u5c3e'

How can i get the pure text in the string to delete html labels ?

possible duplicate of [Extracting text from HTML file using Python](http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) — karthikr, Apr 07 '13 at 04:21

score 0 · Accepted Answer · edited May 23 '17 at 11:56

Use the html2text library:

>>> print html2text.html2text(s)
find /home/tiger/workspace&nbsp_place_holder; -name "[0-9]*"

find /home/tiger/workspace&nbsp_place_holder; -name "[!0-9]*"

find /home/tiger/workspace&nbsp_place_holder; -name "[^0-9]*"


找出所有包含数字的文件，不包含数字的文件。

tiger@debian:~$ find /home/tiger&nbsp_place_holder; -name
"*[0-9]*"&nbsp_place_holder; >kan1

tiger@debian:~$ find /home/tiger&nbsp_place_holder; -name
"[0-9]*"&nbsp_place_holder; >kan2

tiger@debian:~$ find /home/tiger&nbsp_place_holder; -name
"*[0-9]"&nbsp_place_holder; >kan3



包含数字，数字开头，数字结尾

See Extracting text from HTML file using Python for reference.

How to get pure text from a string in python?

1 Answers1