Extract text from HTML in python

Question

Possible Duplicate:
Extracting text from HTML file using Python

What is the best way in Python to extract text from HTML pages in the same way that browser does when you copy-paste?

possible duplicate. I recommend this answer : http://stackoverflow.com/a/3987802/117092 — luc, Jan 13 '12 at 06:26

score 5 · Accepted Answer · answered Jan 13 '12 at 02:19

5

BeautifulSoup is a popular option for reading and parsing HTML pages.

answered Jan 13 '12 at 02:19

Makoto

104,088
27
192
230

Dang. What easy points, @Makoto! `:D` – yurisich Jan 13 '12 at 02:48

score 2 · Answer 2 · edited May 23 '17 at 12:24

The question that monkut references doesn't give any Python solution to the exact problem. While BeautifulSoup and lxml both can be used to parse html, there is still a big step from there to text that approximates the formatting that is embedded in the html.

To do this, I have resorted to non-python solutions (which I've blogged about, but will resist linking here-- not sure of the SO etiquette). If you are on a *nix system, you can install this html2text package from Germany. It can be installed easily on a MacOS with Homebrew ($ brew install html2text) or Macports ($ sudo port install html2text), and on other *nix systems through their package managers. It has a number of useful options, and I use it like this:

html2text -nobs -ascii -width 200 -style pretty -o filename.txt - < filename.html

You can also install a text-based browser (e.g. w3m) and use it to produce formatted text from html using the following command-line syntax: w3m filename.html -dump > file.txt

You can, of course, access these solutions from Python using the subprocess module or the popular envoy wrapper for subprocess.

Even after all this effort, you may find that some important information (e.g. <u> tags) are not handled in a way you like, but those are the best current options that I have found.

Extract text from HTML in python

2 Answers2