HTML indenter written in Python

Question

I am looking for a free (as in freedom) HTML indenter (or re-indenter) written in Python (module or command line). I don't need to filter HTML with a white list. I just want to indent (or re-indent) HTML source to make it more readable. For example, say I have the following code:

<ul><li>Item</li><li>Item
</li></ul>

the output could be something like:

<ul>
    <li>Item</li>
    <li>Item</li>
</ul>

Note: I am not looking for an interface to a non-Python software (for example Tidy, written in C), but a 100% Python script.

Thanks a lot.

score 8 · Answer 1 · answered Jun 25 '11 at 22:02

8

you can use the built-in module xml.dom.minidom's toprettyxml function:

>>> from xml.dom import minidom
>>> x = minidom.parseString("<ul><li>Item</li><li>Item\n</li></ul>")
>>> print x.toprettyxml()
<?xml version="1.0" ?>
<ul>
    <li>
        Item
    </li>
    <li>
        Item
    </li>
</ul>

answered Jun 25 '11 at 22:02

Elisha

4,811
4
30
46

how to remove that `` line? – Arbaz Siddiqui Jan 14 '17 at 15:51
you can just remove the first line `'\n'.join(x.toprettyxml().splitlines()[1:])` (not the best solution but will do the work) – Elisha Jan 15 '17 at 14:38
To remove the header, where `header=''` (that's the current default xml header), this: `re.sub(re.escape(header), '', xml, flags=re.IGNORECASE | re.MULTILINE).strip()` should do it (`import re` beforehand). – PatrickT Jun 06 '20 at 02:08
To remove `` I used this `html = html[23:-1]` It gets rid of blank line at end too. – Harley Feb 19 '22 at 08:12
To remove blank lines it is much safer to use `.strip()` – Elisha Feb 20 '22 at 10:25
Great solution to avoid additional libraries, but the XML parser is more fragile with dirty HTML, and you can get errors like `xml.parsers.expat.ExpatError: not well-formed (invalid token)` -- it may take a bit more effort to clean up the source, see https://stackoverflow.com/questions/48821725/xml-parsers-expat-expaterror-not-well-formed-invalid-token – Mark Chackerian Jun 24 '23 at 15:19

score 7 · Answer 2 · edited Apr 19 '23 at 03:55

Using BeautifulSoup

There are a dozen ways to use the BeautifulSoup module and its prettify function. Here are some examples to get you started.

With a command line

$ python -m BeautifulSoup < somefile.html > prettyfile.html

Within VIM (manually)

You don't have to write the file back to disk if you don't want to, but I included the step that would get the identical effect as the commandline example.

$ vi somefile.html
:!python -m BeautifulSoup < %
:w prettyfile.html

Within VIM (define key-mapping)

In ~/.vimrc define:

nmap =h !python -m BeautifulSoup < %<CR>

Then, when you open a file in vim and it needs beautification

$vi somefile.html
=h
:w prettyfile.html

Once again, saving the beautification is optional.

Python Shell

$ python
>>> from BeautifulSoup import BeautifulSoup as parse_html_string
>>> from os import path
>>> uglyfile = path.abspath('somefile.html')
>>> path.isfile(uglyfile)
True
>>> prettyfile = path.abspath(path.join('.', 'prettyfile.html'))
>>> path.exists(prettyfile)
>>> doc = None
>>> with open(uglyfile, 'r') as infile, open(prettyfile, 'w') as outfile:
...     # Assuming very simple case
...     htmldocstr = infile.read()
...     doc = parse_html_string(htmldocstr)
...     outfile.write(doc.prettify())

# That's it; you can manually manipulate the dom too though
>>> scripts = doc.findAll('script')
>>> meta = doc.findAll('meta')
>>> print doc.prettify()
[imagine beautiful html here]

>>> import jsbeautifier
>>> print jsbeautifier.beautify(script.string)
[imagine beautiful script here]
>>>

score 4 · Answer 3 · answered Jan 08 '16 at 10:07

4

There's also the html5print module. Key features from the description page:

Pretty print HTML as well as embedded CSS and JavaScript within it
Pretty print pure CSS and JavaScript
Try to fix fragmented HTML5
Try to fix HTML with broken unicode encoding
Try to guess encoding of the document, and in some cases manage to convert 8-bit byte code back into correct UTF-8 format
Support both Python 2 and 3

answered Jan 08 '16 at 10:07

thdoan

18,421
1
62
57

This uses bs4 behind the scenes, see other answers for bs4 prettify. – Rod Maniego Apr 06 '22 at 16:14

score 3 · Accepted Answer · edited May 23 '17 at 10:30

3

BeautifulSoup has a function called prettify which does this. See this question

edited May 23 '17 at 10:30

Community

1
1

answered Jun 25 '11 at 21:40

Uku Loskit

40,868
9
92
93

2

**Except it doesn't.** It only gives 1 space per level of indentation, and that isn't parameterizable - the OP wanted 4 spaces per level. It also doesn't allow you specify tags you don't want indented, e.g. ``, or inline elements like `, , ` etc. It essentially has zero parameterizability. This is why you see so many questions asking for this, over a decade. – smci Dec 29 '18 at 09:34

score 1 · Answer 5 · answered Oct 01 '20 at 20:05

1

Here's my pure python solution:

from xml.dom.minidom import parseString as string_to_dom

def prettify(string, html=True):
    dom = string_to_dom(string)
    ugly = dom.toprettyxml(indent="  ")
    split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
    if html:
        split = split[1:]
    pretty = '\n'.join(split)
    return pretty

def pretty_print(html):
    print(prettify(html))

When used on your block of html:

html = """<ul><li>Item</li><li>Item</li></ul>"""
pretty_print(html)

I get:

<ul>
  <li>Item</li>
  <li>Item</li>
</ul>

answered Oct 01 '20 at 20:05

emehex

9,874
10
54
100

I'm getting a "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 110" when using the utf-8 string from lxml html/etree output. – Rod Maniego Apr 06 '22 at 16:11
~@RodManiego Same, did you manage to fix it? – Woahthere May 12 '23 at 13:36