2

I want to replace a class name by another in an html string : class="abc" would become class="xyz". I tried to use regular expressions (I'm using C#) with no success:

const string input = @"abc class=""abcd abc zabc ab c"" abc";

Regex regex = new Regex(string.Format(@"class="".*(?({0})).*""", "abc")); // change this line ?!!

string output = regex.Replace(input, "xyz");

Assert.AreEqual(@"abc class=""abcd xyz zabc ab c"" abc", output);

PS: if it matters: this isn't homework :p

Catalin DICU
  • 4,610
  • 5
  • 34
  • 47
  • what does `(?(abc))` do? – THX-1138 Oct 10 '11 at 16:06
  • not the best idea for many reasons, the two that apply best to this situation are the ugly expression syntax required to handle different types of quotes and spaces (the `class` attribute may be quoted using `"` _or_ `'` quotes and may or may not have spacing including tabs to mess with regular parsers) and the fact that the string `class='abc'` can appear in all sorts of contexts (plain text, etc) - I think your particular problem can be solved purely with regexes, but will either have false positives or negatives depending upon your exact requirements or take a LOT more work than you think. – Code Jockey Oct 10 '11 at 16:15
  • @user93422 it's supposed to match exactly the part I want to replace – Catalin DICU Oct 10 '11 at 16:22
  • I mean I don't think .net's regex has a (?()) construct. There is (?(expression)yes|no) alternatives matching, and there is (?) named group capture, but no `(?(abc))`. I don't think that's the problem in this case, I am just curious if it is an expression new to me. – THX-1138 Oct 10 '11 at 17:08

7 Answers7

2

No wonder you had no success. Parsing HTML can't be done using regexes.

You should use a proper HTML parser like HTML Agility Pack.

Community
  • 1
  • 1
svick
  • 236,525
  • 50
  • 385
  • 514
2

Parsing HTML with Regular Expressions tends to be a futile effort; because most browsers have a fair amount of leeway for badly-formed HTML, you aren't guaranteed to get consistently formed HTML in order to parse with regular expressions easily (and as commented on by svick).

That said, you are better off using a formal HTML parser (I recomment the HTML Agility Pack) and then changing the values of the attributes after you've parsed the document, and then output the changed document if need be.

Community
  • 1
  • 1
casperOne
  • 73,706
  • 19
  • 184
  • 253
  • Even well-formed HTML can't be parsed using regular expressions. HTML isn't regular language. – svick Oct 10 '11 at 17:58
1

Is it a real HTML string? I mean, are you sure you are dealing with well formed HTML? Could there be some error inside your string?

Based on the answers you have given above you can choose how to solve your problem.

  • Yep: use HTML Agility Pack or something similar in order to parse correctly your string;
  • Nope: consider using an XML Parser (like the ones integrated in .NET assemblies). Make sure, however, it works well for you (remember XML is not HTML).

Whatever you choose, please: NEVER use Regular Expressions to parse HTML.

as-cii
  • 12,819
  • 4
  • 41
  • 43
1

I've done a best effort attempt at answering this... a REGEX could be used similar to the following:

@"(?<=<[\w-]+\s+([\w-]+=""[^""]*""\s*)*class=""[^""]*)(?<![\w-])abc(?![\w-])(?=[^""]*""\s*([\w-]+=""[^""]*""\s*)*/?>)"

broken down a little bit:

(?<=<[\w-]+\s+([\w-]+=""[^""]*""\s*)*class=""[^""]*)  #Make sure its inside a tag
(?<![\w-])abc(?![\w-])                                #just the tag abc (not abcd, etc)
(?=[^""]*""\s*([\w-]+=""[^""]*""\s*)*/?>)             #Make sure its really INSIDE a tag

a little further:

(?<=                           #lookbehind
   <[\w-]+\s+                  # match tag name and whitespace
   ([\w-]+=""[^""]*""\s*)*     # match any attributes coming before the class attribute
   class=""[^""]*              # match the class attribute and any other classes before
)                              #end lookbehind
(?<![\w-])abc(?![\w-])         #"abc" at appropriate boundaries
(?=                            #lookahead
   [^""]*""                    # match any remaining classes in the declaration
   \s*([\w-]+=""[^""]*""\s*)*  # match any remaining attributes in the tag
   /?>                         # match the end of the tag
)                              #end lookahead

This will match the string abc inside any class attribute value that is inside a tag (not in text in between tags), and which might or might not have other attributes before or after it.

Attention!

  • IT ONLY HANDLES attribute values in double quotes (")
  • IT ONLY ALLOWS underscores, letters, numbers and dash symbols in the tag and attribute names - you'll need to add colons and periods if you want them (and make it only match names STARTING with a letter if you want it strict)
  • EDIT As discussed in a comment somewhere around here, IT WILL ALSO MATCH abc-1 or not-abc in addition to abc, thus turning <p class="abc-1 abc not-abc">text</p> into <p class="xyz-1 xyz not-zyx">text</p> - because \b will match at the dash character... this gets EXTREMELY HARD TO ACCOUNT FOR!! FOLLOW-UP I added an additional lookahead and lookbehind to hopefully account for the dashes, but who knows... END EDITS

Also, there are bound to be other situations that can break this...

In short - it's probably best not to use this, but instead to use something like HTML Agility Pack - good luck!

Code Jockey
  • 6,611
  • 6
  • 33
  • 45
0

I'm not sure of the C# version of this regex, but here's how it would be done in Ruby:

regex = / class="[^"]*"/i

input.gsub( regex, ' class="abc"' )

This replaces the first instance of a class specifier in the input to be class="abc". It assumes no spaces around the equals, but allows for upper or lower case equivalence.

I assume C# is very similar in terms of describing the regex, and you might have to escape the double quotes.

Are you looking for something more specific? E.g., for a method that takes two inputs (s1 and s2) and replaces class "s1" to class "s2"?

lurker
  • 56,987
  • 9
  • 69
  • 103
0

Obviously Regex is unlikely to be your best choice when working with XML. You will probably have a more consistant result if you try something suggested by the other people. Meanwhile, if you really want some Regex here it is:

const string input = @"abc class=""abcd abc zabc ab c"" abc"; 

Regex regex = new Regex(string.Format(@"(?<=class\=""[^""]*\b){0}\b", "abc")); // I changed this line ?!! 

string output = regex.Replace(input, "xyz");

Assert.AreEqual(@"abc class=""abcd xyz zabc ab c"" abc", output); 

To brake it down:

(               #Start a group
    ?<=         #Positive lookbehind
    class\="    #Some charactors to match against (without consuming)
    [^"]*       #Any other charachactors which are not "
                #This stops us from accidentaly leaving the class attribute
)               #Close the lookbehind group
\b              #A word boundry (Such as whitespace or just before a ")
abc             #Your target
\b              #Another word boundry

Note the positve lookbehind means that we check for "class=" without it being part of our match. That is what we mean by "without consuming".

Note the use of the word boundries, \b, so that we don't accidently match abcd.

Buh Buh
  • 7,443
  • 1
  • 34
  • 61
  • note \b won't deal with dashes and numbers in class name. e.g. \b will match dash in abc-1. `[ "']` would be safer. – THX-1138 Oct 10 '11 at 17:13
  • I believe the comment in the break-down above (and the explanation) should be "Positive lookbehind" not "Negative lookbehind" - i.e.: you want to ensure that it *can* be matched, not that it *cannot* be matched. – Code Jockey Oct 10 '11 at 17:14
  • @user93422 as long as the class name `abc-1` is literal, that should not make a difference - but just one more reason on the pile of reasons to not recreate the wheel out of sand and try to compress it into sandstone, when there's a perfectly good round block of granite to be carved out perfectly for your wheel. – Code Jockey Oct 10 '11 at 17:17
  • @user Yep, thats true... and so the XML/regex headache begins. Lucky for me this isn't my question. Maybe instead of `\b` we could use `[\s"]` ? – Buh Buh Oct 10 '11 at 17:21
  • @Code fixed the negative/positve comment. Thanks. – Buh Buh Oct 10 '11 at 17:23
  • @Code, I think that @user meant that if the input was: `const string input = @"class=""abc abc-1";` then we would match twice by mistake. – Buh Buh Oct 10 '11 at 17:25
  • By the way, `\b` cannot be directly replaced with `[\s"]` because `\b` is a zero-width assertion, and does not capture anything, while `[\s"]` must capture either whitespace or a double quote. running a replace will consume the quote or space (you would need to either insert it into a lookaround, or MAYBE modify the replace expression instead) – Code Jockey Oct 10 '11 at 21:19
  • @code, yep, another good point. It's getting a bit silly now. – Buh Buh Oct 10 '11 at 21:29
0

Disclaimer:

As others have pointed out, using regex to parse non-regular languages is fraught with peril! It is best to use a dedicated parser specifically designed for the job, especially when parsing the tag soup that is HTML.

That said...

If you insist on using a regular expression, here is a regex solution that will do a pretty good job:

text = Regex.Replace(text, @"
    # Change HTML element class attribute value: 'abc' to: 'xyz'.
    (                   # $1: Everything up to 'abc'.
      <\w+              # Begin (X)HTML element open tag.
      (?:               # Match any attribute(s) preceding 'class'.
        \s+             # Whitespace required before each attribute.
        (?!class\b)     # Assert this attribute name is not 'class'.
        [\w\-.:]+       # Required attribute name.
        (?:             # Begin optional attribute value.
          \s*=\s*       # Attribute value separated by =.
          (?:           # Group for attrib value alternatives.
            ""[^""]*""  # Either a double quoted value,
          | '[^']*'     # or a single quoted value,
          | [\w\-.:]+   # or an unquoted value.
          )             # End group for attrib value alternatives.
        )?              # End optional attribute value.
      )*                # Zero or more attributes may precede class.
      \s+               # Whitespace required before class attribute.
      class             # Literal class attribute name.
      \s*=\s*           # Attribute value separated by =.
      (?:               # Group for attrib value alternatives.
        ""              # Either a double quoted value.
        [^""]*?         # Zero or more classes may precede 'abc'.
      | '               # Or a single quoted value.
        [^']*?          # Zero or more classes may precede 'abc'.
      )?                # Or 'abc' class attrib value is unquoted.
    )                   # End $1: Everything up to 'abc'.
    (?<=['""\s=])       # Assert 'abc' not part of '123-abc'.
    abc                 # Match the 'abc' in class attribute value.
    (?=['""\s>])        # Assert 'abc' not part of 'abc-123'.",
    "$1xyz", RegexOptions.IgnorePatternWhitespace);

Example input:

class=abc ... class="abc" ... class='abc'
class = abc ... class = "abc" ... class = 'abc'
class="123 abc 456" ... class='123 abc 456'
class="123-abc abc 456-abc" ... class='123-abc abc 456-abc'
class="abc-123 abc abc-456" ... class='abc-123 abc abc-456'

Example output:

class=xyz ... class="xyz" ... class='xyz'
class = xyz ... class = "xyz" ... class = 'xyz'
class="123 xyz 456" ... class='123 xyz 456'
class="123-abc xyz 456-abc" ... class='123-abc xyz 456-abc'
class="abc-123 xyz abc-456" ... class='abc-123 xyz abc-456'

Note that there will always be edge cases where this solution will fail. e.g. Evil strings within CDATA sections, comments, scripts, styles and tag attribute values can trip this up. (See disclaimer above.) That said, this solution will do a pretty good job for many cases (but will never be 100% reliable!)

Edit: 2011-10-10 14:00 MDT Streamlined overal answer. Removed first regex solution. Modified to correctly ignore classes having similar names like: abc-123 and 123-abc.

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • this will also change ` my class = abc ` into ` my class = xyz ` -- if that is desired, then yay! otherwise, it will still need work. I'll grant you the question should be asked more clearly, as it neither requires nor prohibits that normal text be included in the replacement (it's simply an assumption of mine) – Code Jockey Oct 10 '11 at 17:28
  • @Code Jockey - Yes, you are absolutely correct. Note however, that if required, a more complex regex can be crafted to correctly handle the example case you cite. – ridgerunner Oct 10 '11 at 17:45