3

Beautiful soup smears out HTML so that every element starts a new line.

All the HTML minifiers I have found compress everything to a single line.

Is there, somewhere (and preferably in Python) a tool that will output normal HTML. That is: block elements would get a new line, but inline elements would not?

BS output

<h2>
 headline
</h2>
<p>
  Blah blah
   <b>
     bold text
   </b>
  same paragraph blah
   <a href="">
     a link in the text
   </a>
</p>
<p>
 Another paragraph
</p>

minified

<h2>headline</h2><p> Blah blah <b> bold text </b> same paragraph blah <a href=""> a link in the text </a></p><p> Another paragraph</p>

what i want

<h2>headline</h2>
<p> Blah blah <b> bold text </b> same paragraph blah <a href=""> a link in the text </a></p>
<p> Another paragraph</p>
har07
  • 88,338
  • 12
  • 84
  • 137
Adam Michael Wood
  • 1,730
  • 16
  • 23

1 Answers1

2

Here's a quick and dirty solution.

Make a regex of the opening tags of each block level element. Use str() on the BeautifulSoup tree, then use re.sub() to insert a \n in front of the block level elements.

import re

blocktags = '''\
<address    <article    <aside
<blockquote
<canvas
<dd    <div    <dl
<fieldset    <figcaption    <figure    <footer    <form
<h1    <h2    <h3    <h4    <h5    <h6    <header    <hgroup    <hr
<li
<main
<nav    <noscript
<ol    <output
<p    <pre
<section
<table    <tfoot
<ul
<video'''.split()

pat = re.compile('(' + '|'.join(blocktags) + ')')

blocked_str = pat.sub(r'\n\1', str(soup))
RootTwo
  • 4,288
  • 1
  • 11
  • 15
  • So I combined your solution with htmlmin (https://pypi.python.org/pypi/htmlmin/) instead of BS4's str(). (I need BS4's normal output because of character encoding and some other things.) Works. – Adam Michael Wood Apr 07 '16 at 23:13
  • 5
    The evil lord is coming after you https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Rafael Sierra Jan 11 '18 at 18:54