Using regular expression to trim html

Question

Been trying to solve this for a while now.

I need a regex to strip the newlines, tabs and spaces between the html tags demonstrated in the example below:

Source:

<html>
   <head>
     <title>
           Some title
       </title>
    </head>
</html>

Wanted result:

<html><head><title>Some title</title></head></html>

The trimming of the whitespaces before the "Some title" is optional. I'd be grateful for any help

How do you know what white space to remove? Why are you removing the white space *around* "Some title", but not *in* it? What are your rules here? — Michael Myers, Jun 02 '09 at 17:56
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Samantha Branham, Jun 17 '10 at 06:29

score 20 · Answer 1 · answered Jun 02 '09 at 17:58

20

If the HTML is strict, load it with an XML reader and write it back without formatting. That will preserve the whitespace within tags, but not between them.

answered Jun 02 '09 at 17:58

Welbog

59,154
9
110
123

4

Not to mention it doesn't reinvent the wheel. – Pesto Jun 02 '09 at 18:00
that might depend on the schema. Preservation of whitespace inside tags is a specific attribute in schema definitions. – Jherico Jun 02 '09 at 18:19
1

This. Trying to parse xml/html/other CFLs with a regular expression is impossible to do 100% correctly. – Samantha Branham Jun 17 '10 at 06:25

score 1 · Answer 2 · edited May 23 '17 at 10:33

\d does not match only [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}" and "\x{FF15}"). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want).

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the HTMLAgilityPack answer helpful.

score 0 · Answer 3 · answered Aug 29 '12 at 19:07

A solution with XSLT would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">    
<xsl:output  method="xml" encoding="UTF-8" indent="no"/>

<xsl:template match="*|@*">
    <xsl:copy>
        <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<!-- trim whitespaces from the content -->
<xsl:template match="text()">
    <!-- remove from tag to content -->
    <xsl:variable name="trimmedHead" select="replace(.,'^\s+','')"/>
    <xsl:variable name="trimmed" select="replace($trimmedHead,'\s+$','')"/>
    <xsl:value-of select="$trimmed"/>
</xsl:template>

<!-- do not trim where text content exist -->
<xsl:template match="text()">
    <xsl:if test="not(matches(.,'^\s+$'))">
        <xsl:value-of select="."/>
    </xsl:if>
</xsl:template>

You can choose the template you would like to use. The first removes all whitespaces also when content exists, and the second one removes only when there are just whitespaces or newlines.

score 0 · Answer 4 · answered Jun 17 '10 at 06:18

0

Regex.Replace(input, "<[^>]*>", String.Empty);

answered Jun 17 '10 at 06:18

dankyy1

1,094
2
16
32

score 0 · Answer 5 · edited Aug 29 '12 at 18:43

0

Try this:

s/[^\w\/\d<>]+/gs

edited Aug 29 '12 at 18:43

Rostyslav Dzinko

39,424
5
49
62

answered Jun 02 '09 at 17:56

user105033

18,800
19
58
69

score 0 · Answer 6 · answered Jun 02 '09 at 17:58

0

s/>\s+</></gs

answered Jun 02 '09 at 17:58

JSBձոգչ

40,684
18
101
169

ʞɔıu · Accepted Answer · 2009-06-02T18:04:02.863

0

s/\s*(<[^>]+>)\s*/\1/gs

or, in c#:

Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

edited Jun 02 '09 at 18:04

answered Jun 02 '09 at 17:58

ʞɔıu

47,148
35
106
149

the first character cannot be a space, or a valid HTML string like "if a < 3 and b > 4" would be deleted with your expression – Yann Schwartz Jun 02 '09 at 18:45
And you don't match ending tags either. – Yann Schwartz Jun 02 '09 at 18:46
Your first point isn't wrong, though. That'll change "if a < 3 and b > 4" to "if a<3 and b>4", which is probably OK if that's script, but probably not desirable if it's, say, the text of an article about using whitespace for readability. – Robert Rossney Jun 02 '09 at 20:07
Yeah the <[^>]+> to match all html tag innards has a number of edge cases. There are more complete patterns that could be used instead of that subpattern, but this demonstrates the basic idea. – ʞɔıu Jun 02 '09 at 21:20

score 0 · Answer 8 · answered Jun 02 '09 at 19:18

0

This removes the whitespace between tags and the space between the tags and the text.

s/(\s*(<))|((>)\s*)/\2\4/g

answered Jun 02 '09 at 19:18

Bran Handley

153
1
3

score -1 · Answer 9 · edited Dec 20 '12 at 03:41

-1

I wanted to preserve the new lines, since the removal of newlines was messing up my html. So I went with the following. .

private static string ProcessHTMLFile(string input)
{
    string opt = Regex.Replace(input, @"(  )*", "", RegexOptions.Singleline);
    opt = Regex.Replace(opt, @"[\t]*", "", RegexOptions.Singleline);
    return opt;
}

edited Dec 20 '12 at 03:41

John Conde

217,595
99
455
496

answered Jun 14 '10 at 05:00

Shash

1

Using regular expression to trim html

9 Answers9

Linked