1

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:

<appid="928" appname="extractapp" supportemail="me@mydomain.com" /><appid="928" appname="extractapp" supportemail="me@mydomain.com" />

The tags are repeated one after another and all have different appid, appname, supportemail.

I need to just extract all of the support emails, just the email address, without the supportemail=

Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?

I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.

I've never really used Regex too much, so don't know if it's suitable for the above?

I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.

TimS
  • 5,922
  • 6
  • 35
  • 55
Dan Harris
  • 1,336
  • 1
  • 21
  • 44
  • 2
    Considering this is XML, why not just use the XmlTextReader? - http://msdn.microsoft.com/en-us/library/system.xml.xmltextreader.aspx – Lloyd Mar 10 '11 at 13:27
  • I agree, the XML reader should be your first choice, unless you're REALLY sure that the input is always formatted the way you posted. If it needs to be regexp, one regexp that uses groups will be enough (though I can't recite the correct c# syntax for that off the top of my head) – grimmig Mar 10 '11 at 13:35
  • I didn't think of XML as the tags didn't have a name at the beginning, just a list of attributes. Even if this can't be done with it in it's raw format, I already see an easy way around it. Thanks – Dan Harris Mar 10 '11 at 13:44

5 Answers5

3

Have you considered Linq to XML?

http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx

TimS
  • 5,922
  • 6
  • 35
  • 55
0

What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?

hungryMind
  • 6,931
  • 4
  • 29
  • 45
0

Use

string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
    Console.WriteLine(m.Groups[1].Value);

See it here.

Aliostad
  • 80,612
  • 21
  • 160
  • 208
0

Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):

(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")

You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.

Josh M.
  • 26,437
  • 24
  • 119
  • 200
0

Problems you'll encounter by using regular expressions instead of an XML DOM:

  1. All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.

  2. Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.

  3. A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &amp;, &apos;, &gt;, &lt;, and &quot;.

  4. There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

Community
  • 1
  • 1
Robert Rossney
  • 94,622
  • 24
  • 146
  • 218