1

Been trying to solve this for a while now.

I need a regex to strip the newlines, tabs and spaces between the html tags demonstrated in the example below:

Source:

<html>
   <head>
     <title>
           Some title
       </title>
    </head>
</html>

Wanted result:

<html><head><title>Some title</title></head></html>

The trimming of the whitespaces before the "Some title" is optional. I'd be grateful for any help

Tim Skauge
  • 1,814
  • 2
  • 23
  • 35
  • 1
    How do you know what white space to remove? Why are you removing the white space *around* "Some title", but not *in* it? What are your rules here? – Michael Myers Jun 02 '09 at 17:56
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Samantha Branham Jun 17 '10 at 06:29

9 Answers9

20

If the HTML is strict, load it with an XML reader and write it back without formatting. That will preserve the whitespace within tags, but not between them.

Welbog
  • 59,154
  • 9
  • 110
  • 123
1

\d does not match only [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}" and "\x{FF15}"). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want).

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the HTMLAgilityPack answer helpful.

Community
  • 1
  • 1
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
0

A solution with XSLT would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">    
<xsl:output  method="xml" encoding="UTF-8" indent="no"/>

<xsl:template match="*|@*">
    <xsl:copy>
        <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<!-- trim whitespaces from the content -->
<xsl:template match="text()">
    <!-- remove from tag to content -->
    <xsl:variable name="trimmedHead" select="replace(.,'^\s+','')"/>
    <xsl:variable name="trimmed" select="replace($trimmedHead,'\s+$','')"/>
    <xsl:value-of select="$trimmed"/>
</xsl:template>

<!-- do not trim where text content exist -->
<xsl:template match="text()">
    <xsl:if test="not(matches(.,'^\s+$'))">
        <xsl:value-of select="."/>
    </xsl:if>
</xsl:template>

You can choose the template you would like to use. The first removes all whitespaces also when content exists, and the second one removes only when there are just whitespaces or newlines.

Philipp
  • 4,645
  • 3
  • 47
  • 80
0
Regex.Replace(input, "<[^>]*>", String.Empty);
dankyy1
  • 1,094
  • 2
  • 16
  • 32
0

Try this:

s/[^\w\/\d<>]+/gs
Rostyslav Dzinko
  • 39,424
  • 5
  • 49
  • 62
user105033
  • 18,800
  • 19
  • 58
  • 69
0

s/>\s+</></gs

JSBձոգչ
  • 40,684
  • 18
  • 101
  • 169
0

s/\s*(<[^>]+>)\s*/\1/gs

or, in c#:

Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

ʞɔıu
  • 47,148
  • 35
  • 106
  • 149
  • the first character cannot be a space, or a valid HTML string like "if a < 3 and b > 4" would be deleted with your expression – Yann Schwartz Jun 02 '09 at 18:45
  • And you don't match ending tags either. – Yann Schwartz Jun 02 '09 at 18:46
  • Your first point isn't wrong, though. That'll change "if a < 3 and b > 4" to "if a<3 and b>4", which is probably OK if that's script, but probably not desirable if it's, say, the text of an article about using whitespace for readability. – Robert Rossney Jun 02 '09 at 20:07
  • Yeah the <[^>]+> to match all html tag innards has a number of edge cases. There are more complete patterns that could be used instead of that subpattern, but this demonstrates the basic idea. – ʞɔıu Jun 02 '09 at 21:20
0

This removes the whitespace between tags and the space between the tags and the text.

s/(\s*(<))|((>)\s*)/\2\4/g
Bran Handley
  • 153
  • 1
  • 3
-1

I wanted to preserve the new lines, since the removal of newlines was messing up my html. So I went with the following. .

private static string ProcessHTMLFile(string input)
{
    string opt = Regex.Replace(input, @"(  )*", "", RegexOptions.Singleline);
    opt = Regex.Replace(opt, @"[\t]*", "", RegexOptions.Singleline);
    return opt;
}
John Conde
  • 217,595
  • 99
  • 455
  • 496
Shash
  • 1