html-to-text conversion using Python standard library only

Question

I'm looking for the best way to convert HTML to text, using only modules from the Python 2.7.x standard library. (I.e., no BeautifulSoup, etc.)

By HTML-to-text conversion I mean the moral equivalent of lynx -dump. In fact, just getting rid of HTML tags intelligently, and converting all HTML-entities to ASCII (or to UTF8-encoded unicode), would suffice.

No regex-based answers, please. (Regexes are not up to the task.)

Thanks!

vartec · Accepted Answer · 2012-03-19T15:44:18.070

5

Python since 2.2 has HTMLParser module. It's not the most efficient nor the easiest use, but it's there...

And if you're dealing with proper XHTML (or you can pass it through Tidy), you can use much better ElementTree

from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("your_document.xhtml")
your_string = tree.tostring(method="text", encoding="utf-8")

edited Mar 19 '12 at 15:44

answered Mar 19 '12 at 15:32

vartec

131,205
36
218
244

score 0 · Answer 2 · answered Jul 26 '19 at 15:58

I wrote a really simple python script that extracts headings and paragraphs only from HTML files without using any third-party Libraries. Note: This script is really simple and can only handle really simple HTML. And its written in python 3

#!/usr/bin/env python3
import os
#This is a standard python module
headings = "<h1>"
paragraphs = "<p>"



f = open('filename.html')
f.close

for line in f: 
   if headings in line:
      print ("line")
   If paragraphs in line:
     print ("line")

You can still expand on this idea and make it extract more stuff from the HTML file.

Have you even tried this code? I think it's unfair to even call it code. — Samy, Oct 08 '21 at 19:03

score -1 · Answer 3 · edited May 23 '17 at 11:43

-1

I would also suggest that you should take a look at html2text.
Also take a look at another thread

edited May 23 '17 at 11:43

Community

1
1

answered Mar 19 '12 at 21:05

kiran

27
1

i specifically asked for answers that required only modules in the standard python distribution; html2text is not in the standard library – kjo Mar 20 '12 at 00:10

html-to-text conversion using Python standard library only

3 Answers3