-1

Possible Duplicate:
Extracting text from HTML file using Python

What is the best way in Python to extract text from HTML pages in the same way that browser does when you copy-paste?

Community
  • 1
  • 1
Mark Vital
  • 960
  • 3
  • 14
  • 25
  • possible duplicate. I recommend this answer : http://stackoverflow.com/a/3987802/117092 – luc Jan 13 '12 at 06:26

2 Answers2

5

BeautifulSoup is a popular option for reading and parsing HTML pages.

Makoto
  • 104,088
  • 27
  • 192
  • 230
2

The question that monkut references doesn't give any Python solution to the exact problem. While BeautifulSoup and lxml both can be used to parse html, there is still a big step from there to text that approximates the formatting that is embedded in the html.

To do this, I have resorted to non-python solutions (which I've blogged about, but will resist linking here-- not sure of the SO etiquette). If you are on a *nix system, you can install this html2text package from Germany. It can be installed easily on a MacOS with Homebrew ($ brew install html2text) or Macports ($ sudo port install html2text), and on other *nix systems through their package managers. It has a number of useful options, and I use it like this:

html2text -nobs -ascii -width 200 -style pretty -o filename.txt - < filename.html

You can also install a text-based browser (e.g. w3m) and use it to produce formatted text from html using the following command-line syntax: w3m filename.html -dump > file.txt

You can, of course, access these solutions from Python using the subprocess module or the popular envoy wrapper for subprocess.

Even after all this effort, you may find that some important information (e.g. <u> tags) are not handled in a way you like, but those are the best current options that I have found.

Community
  • 1
  • 1
Ari
  • 460
  • 6
  • 13