Remove unwanted line feeds from an HTML file

Question

I have a lot of HTML files which have unwanted line-feeds. These break things like inline javascript and formatting within the pages. I want to come up with a way to strip out all line feeds from the pages that do not appear directly after an html tag e.g </div>. Does anyone know of a regex and/or program that may be able to acheive this?

You might benefit from a minifier. See http://stackoverflow.com/questions/728260/html-minification/1102101. — David Andres, Sep 16 '09 at 11:20

score 1 · Answer 1 · answered Sep 16 '09 at 11:54

1

You may be able to use Notepad++'s search/replace function, with a regular expression to catch most of this.

Something like:

([^>])\n(.+)

Replaced with:

\1 \2

answered Sep 16 '09 at 11:54

DisgruntledGoat

70,219
68
205
290

1

Depending on the format of the html file, you may need to use ([^>])\r\n(.+) or ([^>])\r(.+) instead. – Brian Sep 16 '09 at 13:07

score 0 · Answer 2 · answered Sep 16 '09 at 18:23

0

You can use a negative lookbehind to match the line feeds

<?php

$buffer = file_get_contents('test.html');

// replace all line feeds not preceded by </div>
$buffer = preg_replace('|(?<!</div>)[\r\n]|', "", $buffer);

file_put_contents('test.new.html', $buffer);
?>

see: http://www.regular-expressions.info/lookaround.html

answered Sep 16 '09 at 18:23

Lance Rushing

7,540
4
29
34

Or just use the RE: (?<!)[\r\n] in your favorite editor. – Lance Rushing Sep 16 '09 at 18:25
you may actually want something more like (?<![^>]+>)(\r?\n){2,} i.e. any closing tag with more than 1 CRLF (where CR is optional) – Neel Sep 29 '09 at 11:29

Remove unwanted line feeds from an HTML file

2 Answers2