13

Does anyone know any good java library (or single method) that can strip extra spaces (line breaks, tabs, etc) from an html file? So html file gets turned into 1 line basically.

Thanks.

UPDATE: Looks like there is no library that does that so I created my own open source project for solving this task: http://code.google.com/p/htmlcompressor/

serg
  • 109,619
  • 77
  • 317
  • 330

5 Answers5

25

Looks like there is no library that does that so I created my own open source project for solving this task, maybe someone will find it helpful: http://code.google.com/p/htmlcompressor/

serg
  • 109,619
  • 77
  • 317
  • 330
6

Personally, I just enabled HTTP compression in the server and I leave my HTML readable.

But for what you want, you could just use String.replaceAll() with a regex that matching what you have specified. Off the top of my head, something like:

small=large.replaceAll("\\s{2,}"," ");
Jacob van Lingen
  • 8,989
  • 7
  • 48
  • 78
Lawrence Dol
  • 63,018
  • 25
  • 139
  • 189
  • The only problem is that if you have a string that contains spaces, then those spaces will be erased as well. Also it will break alot of HTML formatting just for example "
    – Suroot Mar 06 '09 at 03:22
  • 1
    @Suroot no, it's fine. It replaces multiple spaces with just one. – sblundy Mar 06 '09 at 03:29
  • @ sblundy but "Hello World" will become "Hello World" which isn't what you want if "Hello World" is what is supposed to be displayed. – TofuBeer Mar 06 '09 at 03:33
  • Well, that's some basic compression and that's what I am currently doing. It gets much deeper than that if you want to do it perfect and remove all possible characters (different rules apply for inside and outside of the tags). I think it is a common task and hope that someone already did it right. – serg Mar 06 '09 at 03:34
  • 1
    @Suroot Browsers convert multiple spaces to a single space. For example, your two "Hello Worlds" look the same. If you want multiple spaces, you need to use @nbsp;. – sblundy Mar 06 '09 at 03:39
  • 1
    Of course, if you rely on multiple spaces for formatting inside a
     tag, this will be fubared.
    – Evan Mar 06 '09 at 04:00
  • Which is why I don't like screwing with the HTML, preferring HTTP compression. – Lawrence Dol Mar 06 '09 at 06:07
  • For HTML, compressing any multispaces to one within a tag, including attribute values, should have no impact. The only corner case is
     and tag content whose CSS class has pre-format behavior.
    – Lawrence Dol Mar 06 '09 at 06:09
2

Be careful with that. Text inside pre and textarea elements will be damaged. In addition, inlined javascript inside script elements will have to be ended with column;. Lastly if you code inlined javascript with html comments (to avoid some old browser buggy behavior) this will eventually comment out the whole inlined javascript code.

Why do you want to do that? If you want to decrease the download size of the html then all you need is a GZIP filter.

Community
  • 1
  • 1
cherouvim
  • 31,725
  • 15
  • 104
  • 153
0

Assuming the desire is to make the HTML smaller to optimize the bytes sent over the network why not have the HTTP server do the work? Read here.

Will this work? Not free unfortunately.

TofuBeer
  • 60,850
  • 18
  • 118
  • 163
  • Already using it. I still would like to have a compression though. – serg Mar 06 '09 at 03:38
  • Does it have to be Java? DoOes it have to be free? – TofuBeer Mar 06 '09 at 03:42
  • There's no point at all in whitespace collapsing your HTML if you are applying HTTP compression - the end result will be so close as to not matter for the size of data across the wire. WS collapsing just adds another pre-deployment step. – Lawrence Dol Mar 06 '09 at 06:12
-1
input.replaceAll("\s+", " ");

will convert any whitespace into a single space

sblundy
  • 60,628
  • 22
  • 121
  • 123
cobbal
  • 69,903
  • 20
  • 143
  • 156