I have a lot of HTML files which have unwanted line-feeds. These break things like inline javascript and formatting within the pages. I want to come up with a way to strip out all line feeds from the pages that do not appear directly after an html tag e.g </div>
. Does anyone know of a regex and/or program that may be able to acheive this?
Asked
Active
Viewed 482 times
1

soulmerge
- 73,842
- 19
- 118
- 155
-
1You might benefit from a minifier. See http://stackoverflow.com/questions/728260/html-minification/1102101. – David Andres Sep 16 '09 at 11:20
2 Answers
1
You may be able to use Notepad++'s search/replace function, with a regular expression to catch most of this.
Something like:
([^>])\n(.+)
Replaced with:
\1 \2

DisgruntledGoat
- 70,219
- 68
- 205
- 290
-
1Depending on the format of the html file, you may need to use ([^>])\r\n(.+) or ([^>])\r(.+) instead. – Brian Sep 16 '09 at 13:07
0
You can use a negative lookbehind to match the line feeds
<?php
$buffer = file_get_contents('test.html');
// replace all line feeds not preceded by </div>
$buffer = preg_replace('|(?<!</div>)[\r\n]|', "", $buffer);
file_put_contents('test.new.html', $buffer);
?>

Lance Rushing
- 7,540
- 4
- 29
- 34
-
-
you may actually want something more like (?<![^>]+>)(\r?\n){2,} i.e. any closing tag with more than 1 CRLF (where CR is optional) – Neel Sep 29 '09 at 11:29