2

I have custom tag for FLASH object, which i want to include in cms content. now when i read the content, i would like to grab those custom tag and the value in between.

Custom TAG:

<myflash filename="test.swf" width="500" height="400">
  <param name="wmode" value="somevalue"></param>
  <param name="bgcolor" value="#ffffff"></param>
  <var name="id" value="testid"></var>
</myflash>

now i'll require a regular expression which will read these entire block of code from the content. there will be more than one custom tag in one single content.

can anyone help please?

Kind regards,

Vipul

annakata
  • 74,572
  • 17
  • 113
  • 180
user97586
  • 21
  • 1
  • 2
  • As repeatedly stated on a large number of similar questions regex is not an appropriate tool for parsing HTML. – annakata Apr 29 '09 at 09:29
  • That depends on the problem, XML structure, size, context etc... The cost of instantiating XML+XPath framework may not be worth it if XML in question is small and performance is the key. You are generally right, but there are always special cases. – majkinetor Apr 29 '09 at 11:03

3 Answers3

5

Regex is, IMO, the wrong tool for processing XML. Why not use XmlDocument or XDocument etc? If that is HTML (note no "X"), then the HTML Agility Pack may be useful.

With both XmlDocument and the HTML Agility Pack you can use xpath/xquery, so you can simply use .SelectNodes("//myflash"). XDocument has similar, but a different method: .Descendants("myFlash").

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • -1 That isn't the answer ... You provide the answer then eventual notes. Notes without answers are no good. – majkinetor Apr 29 '09 at 11:00
  • @majkinetor - how does .SelectNodes("//myflash") not answer it? It is the work of 2 seconds to discover .InnerXml and .OuterXml, for example. The reason I didn't include this is because the route is different for each of the 3 options, and that choice depends on a: xml vs html (not specified in the question), and b: XmlDocument vs XDocument (which repends on the .NET version, not specified in the question). So go on then: how would you unambiguously answer it? – Marc Gravell Apr 29 '09 at 11:51
  • Its not becuase the man asked for RE, not XPath. Instead of speculating about methods he use (your advice is sound, thats not the problem) its better to answer the real question, then offer alternative (or semantically better) method. – majkinetor Apr 29 '09 at 12:00
  • @majkinetor - right, and if somebody asks for a hammer to put some screws in, do you hand them a hammer? Or do you tell them about screwdrivers? – Marc Gravell Apr 29 '09 at 12:21
  • I give them a hammer and tell them about screwdriver :P – majkinetor Apr 29 '09 at 16:07
3

You can start with a very simple regex:

<myflash[^>]*>(.*?)</myflash>

Just make sure to use the "non-greedy" capture (.*?), so that the ".*" matches as little as possible.

Also, use RegexOptions.SingleLine, so that the dot matches every character, including \n:

Regex re = new Regex("<myflash[^>]*>(.*?)</myflash>", RegexOptions.SingleLine);
Ferdinand Beyer
  • 64,979
  • 15
  • 154
  • 145
  • this expression is not working, might be because it has tags inside it.. – user97586 Apr 29 '09 at 10:10
  • The PARAM tags shouldn't matter. Did you use the SingleLine flag? You might want to use IgnoreCase too, if your tags don't always use lowercase names. If that doesn't work, we would need to see your code, because the regex does exactly what you asked for. – Alan Moore Apr 29 '09 at 11:00
  • @majkinetor, the Multiline flag won't change anything. It allows ^ and $ to match the beginning and end, respectively, of logical lines as well as the beginning and end of the whole string. – Alan Moore Apr 29 '09 at 11:04
  • Ye... the point was actually to see if dot operator consumes new lines. I don't know why I contected that with Multine :) – majkinetor Apr 29 '09 at 11:58
  • Note that the `>` is allowed in attribute values. – Gumbo Apr 29 '09 at 13:21
  • The single-/multiline options are not just badly named, they shouldn't exist at all. They're a Perl-historical artifact, and in Perl 6 they've finally been done away with. Who knows how long the rest of us will be stuck with them. :-/ – Alan Moore Apr 30 '09 at 02:41
  • @Gumbo: No it isn't -- it must be encoded as entity (>, although browsers will tolerate it). – Ferdinand Beyer Apr 30 '09 at 11:06
0

As Marc Gravell says, regexes are not suited to parsing HTML (or XML). See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why. You are much better off using an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples of how to use parsers in many languages (there are at least two examples using C#).

Community
  • 1
  • 1
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226