14

I am looking for a free (as in freedom) HTML indenter (or re-indenter) written in Python (module or command line). I don't need to filter HTML with a white list. I just want to indent (or re-indent) HTML source to make it more readable. For example, say I have the following code:

<ul><li>Item</li><li>Item
</li></ul>

the output could be something like:

<ul>
    <li>Item</li>
    <li>Item</li>
</ul>

Note: I am not looking for an interface to a non-Python software (for example Tidy, written in C), but a 100% Python script.

Thanks a lot.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
jep
  • 387
  • 2
  • 8

5 Answers5

8

you can use the built-in module xml.dom.minidom's toprettyxml function:

>>> from xml.dom import minidom
>>> x = minidom.parseString("<ul><li>Item</li><li>Item\n</li></ul>")
>>> print x.toprettyxml()
<?xml version="1.0" ?>
<ul>
    <li>
        Item
    </li>
    <li>
        Item
    </li>
</ul>
Elisha
  • 4,811
  • 4
  • 30
  • 46
  • how to remove that `` line? – Arbaz Siddiqui Jan 14 '17 at 15:51
  • you can just remove the first line `'\n'.join(x.toprettyxml().splitlines()[1:])` (not the best solution but will do the work) – Elisha Jan 15 '17 at 14:38
  • To remove the header, where `header=''` (that's the current default xml header), this: `re.sub(re.escape(header), '', xml, flags=re.IGNORECASE | re.MULTILINE).strip()` should do it (`import re` beforehand). – PatrickT Jun 06 '20 at 02:08
  • To remove `` I used this `html = html[23:-1]` It gets rid of blank line at end too. – Harley Feb 19 '22 at 08:12
  • To remove blank lines it is much safer to use `.strip()` – Elisha Feb 20 '22 at 10:25
  • Great solution to avoid additional libraries, but the XML parser is more fragile with dirty HTML, and you can get errors like `xml.parsers.expat.ExpatError: not well-formed (invalid token)` -- it may take a bit more effort to clean up the source, see https://stackoverflow.com/questions/48821725/xml-parsers-expat-expaterror-not-well-formed-invalid-token – Mark Chackerian Jun 24 '23 at 15:19
7

Using BeautifulSoup

There are a dozen ways to use the BeautifulSoup module and its prettify function. Here are some examples to get you started.

With a command line

$ python -m BeautifulSoup < somefile.html > prettyfile.html

Within VIM (manually)

You don't have to write the file back to disk if you don't want to, but I included the step that would get the identical effect as the commandline example.

$ vi somefile.html
:!python -m BeautifulSoup < %
:w prettyfile.html

Within VIM (define key-mapping)

In ~/.vimrc define:

nmap =h !python -m BeautifulSoup < %<CR>

Then, when you open a file in vim and it needs beautification

$vi somefile.html
=h
:w prettyfile.html

Once again, saving the beautification is optional.

Python Shell

$ python
>>> from BeautifulSoup import BeautifulSoup as parse_html_string
>>> from os import path
>>> uglyfile = path.abspath('somefile.html')
>>> path.isfile(uglyfile)
True
>>> prettyfile = path.abspath(path.join('.', 'prettyfile.html'))
>>> path.exists(prettyfile)
>>> doc = None
>>> with open(uglyfile, 'r') as infile, open(prettyfile, 'w') as outfile:
...     # Assuming very simple case
...     htmldocstr = infile.read()
...     doc = parse_html_string(htmldocstr)
...     outfile.write(doc.prettify())

# That's it; you can manually manipulate the dom too though
>>> scripts = doc.findAll('script')
>>> meta = doc.findAll('meta')
>>> print doc.prettify()
[imagine beautiful html here]

>>> import jsbeautifier
>>> print jsbeautifier.beautify(script.string)
[imagine beautiful script here]
>>> 
Plop
  • 131
  • 6
Guy Hoozdis
  • 111
  • 1
  • 6
4

There's also the html5print module. Key features from the description page:

  • Pretty print HTML as well as embedded CSS and JavaScript within it
  • Pretty print pure CSS and JavaScript
  • Try to fix fragmented HTML5
  • Try to fix HTML with broken unicode encoding
  • Try to guess encoding of the document, and in some cases manage to convert 8-bit byte code back into correct UTF-8 format
  • Support both Python 2 and 3
thdoan
  • 18,421
  • 1
  • 62
  • 57
3

BeautifulSoup has a function called prettify which does this. See this question

Community
  • 1
  • 1
Uku Loskit
  • 40,868
  • 9
  • 92
  • 93
1

Here's my pure python solution:

from xml.dom.minidom import parseString as string_to_dom

def prettify(string, html=True):
    dom = string_to_dom(string)
    ugly = dom.toprettyxml(indent="  ")
    split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
    if html:
        split = split[1:]
    pretty = '\n'.join(split)
    return pretty

def pretty_print(html):
    print(prettify(html))

When used on your block of html:

html = """<ul><li>Item</li><li>Item</li></ul>"""
pretty_print(html)

I get:

<ul>
  <li>Item</li>
  <li>Item</li>
</ul>
emehex
  • 9,874
  • 10
  • 54
  • 100