4

I have a string like this:

This <span class="highlight">is</span> a very "nice" day!

What should my RegEx-pattern in VB look like, to find the quotes within the tag? I want to replace it with something...

This <span class=^highlight^>is</span> a very "nice" day!

Something like <(")[^>]+> doesn't work :(

Thanks

John Koerner
  • 37,428
  • 8
  • 84
  • 134
  • 1
    Which regular expression engine are you using? – Martin Brown May 13 '09 at 12:32
  • html grammar is not regular grammar use an html parser etc etc etc etc – annakata May 13 '09 at 12:51
  • Hi, I am using the engine provided by vb, so lookbehind is not supported. –  May 13 '09 at 13:29
  • @Moo The language you are using is one of the more important tags since it will prevent people from telling you how to do things you can't do and makes sure that people who know about your environment see your question. I have replaced the pattern tag with vb, please edit it so it reflects the version of vb you are using. – Chas. Owens May 13 '09 at 14:10
  • @Moo VB.Net does support look behind you just use a group starting with ?<=. So (?<=X). matches any character with an X infront. – Martin Brown May 13 '09 at 14:21

5 Answers5

12

It depends on your regex flavor, but this works for most of them:

"(?=[^<]*>)

EDIT: For anyone curious how this works. This translates into English as "Find a quote that is followed by a > before the next <".

Nick Whaley
  • 2,729
  • 2
  • 21
  • 28
  • 1
    Note that the plain `>` character is allowed in attribute values. – Gumbo May 13 '09 at 14:10
  • @Gumbo Interesting note but the '>' character will not be a problem if it appears in an attribute. The '<' character however will be. – Nick Whaley May 13 '09 at 14:43
  • @Nick The pattern has problems, if the string looks like: This "is" > great! How can we improve it? –  May 14 '09 at 07:27
  • @Moo, the '>' character is not valid have between tags. It needs to be escaped as '>'. But if you need to be that picky, you need to get a real HTML parser. – Nick Whaley May 14 '09 at 14:53
2

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

If you are using VB.net you should be able to use HTMLAgilityPack.

Community
  • 1
  • 1
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
-1

Try this: <span class="([^"]+?)?">

Dario
  • 48,658
  • 8
  • 97
  • 130
-1

This should get your the first attribute value in a tag:

<[^">]+"(?<value>[^"]*)"[^>]*>
-1

If your intention is to replace ALL quotation marks within tags, you could use the following regular expression:

(<[^>"]*)(")([^>]*>)

That will isolate the substrings before and after your quotation mark. Note that this does not attempt to match opening and closing quotation marks. It simply matches a quotation mark within a tag.

Krsna
  • 444
  • 3
  • 9
  • Yes, my intention is to replace all quotation marks within tags. Do I have to loop through all submatches then? –  May 13 '09 at 13:31
  • So, are you able to use Regex.Replace? http://msdn.microsoft.com/en-us/library/xwewhkd1.aspx – Krsna May 13 '09 at 13:51
  • Yes, I'd use the replace function. But I don't know how to use it with the pattern. It doesn't find the quotes within a tag. –  May 14 '09 at 09:03