Possible Duplicate:
Extracting text from HTML file using Python
What is the best way in Python to extract text from HTML pages in the same way that browser does when you copy-paste?
Possible Duplicate:
Extracting text from HTML file using Python
What is the best way in Python to extract text from HTML pages in the same way that browser does when you copy-paste?
The question that monkut references doesn't give any Python solution to the exact problem. While BeautifulSoup and lxml both can be used to parse html, there is still a big step from there to text that approximates the formatting that is embedded in the html.
To do this, I have resorted to non-python solutions (which I've blogged about, but will resist linking here-- not sure of the SO etiquette). If you are on a *nix system, you can install this html2text package from Germany. It can be installed easily on a MacOS with Homebrew ($ brew install html2text
) or Macports ($ sudo port install html2text
), and on other *nix systems through their package managers. It has a number of useful options, and I use it like this:
html2text -nobs -ascii -width 200 -style pretty -o filename.txt - < filename.html
You can also install a text-based browser (e.g. w3m) and use it to produce formatted text from html using the following command-line syntax:
w3m filename.html -dump > file.txt
You can, of course, access these solutions from Python using the subprocess module or the popular envoy wrapper for subprocess
.
Even after all this effort, you may find that some important information (e.g. <u>
tags) are not handled in a way you like, but those are the best current options that I have found.