0

Hi I have a html like

<html>
   <head>
     <title>
          Some title
   </title>
</head>
<body>
    <div id="one">         some sample info </div>
</body>
</html>

How can I remove white spaces in this html except those in contents and within the tags using some regex using preg_replace? so to get something like this

<html><head><title>Some title</title></head><body><div id="one">some sample info</div></body></html>

please can anyone help me with this?? :)

Shades88
  • 7,934
  • 22
  • 88
  • 130

1 Answers1

5

You can replace (?<=>)\s+(?=<)|(?<=>)\s+(?!=<)|(?!<=>)\s+(?=<) with empty strings.

Edit: There's a simpler form: replace (?<=>)\s+|\s+(?=<)

Simply spoken, this regex will replace a group of one or more whitespaces if it has a > to the left or a < to the right.

It actually has two parts joined by OR (symbol: |), so either one may match:

  1. (?<=>)\s+ - this will match one or more whitespaces (\s+ in the regex), if it is preceded by a < (in regex: (?<=>)).

  2. \s+(?!=<) - this will match one or more whitespaces if it is followed by a < (in regex: (?!=<))

Learn more about regex.

Sufian Latif
  • 13,086
  • 3
  • 33
  • 70
  • 1
    This answer is completely unstable and relies on the notion that there are no lingering `>` or `<` symbols in any of the textnodes in the html document. I would not recommend this technique to anyone. This is just another case where using regex to do a DOM parser's job is inappropriate. Researchers, please be informed that regex is "DOM-ignorant" -- it doesn't know if it is matching the start/end of a tag or merely something that resembles the start/end of a tag. At the very least, this regex is too primitive to do a consistently good job. – mickmackusa Nov 04 '21 at 09:17