Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

Question

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:

<appid="928" appname="extractapp" supportemail="me@mydomain.com" /><appid="928" appname="extractapp" supportemail="me@mydomain.com" />

The tags are repeated one after another and all have different appid, appname, supportemail.

I need to just extract all of the support emails, just the email address, without the supportemail=

Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?

I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.

I've never really used Regex too much, so don't know if it's suitable for the above?

I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.

Considering this is XML, why not just use the XmlTextReader? - http://msdn.microsoft.com/en-us/library/system.xml.xmltextreader.aspx — Lloyd, Mar 10 '11 at 13:27
I agree, the XML reader should be your first choice, unless you're REALLY sure that the input is always formatted the way you posted. If it needs to be regexp, one regexp that uses groups will be enough (though I can't recite the correct c# syntax for that off the top of my head) — grimmig, Mar 10 '11 at 13:35
I didn't think of XML as the tags didn't have a name at the beginning, just a list of attributes. Even if this can't be done with it in it's raw format, I already see an easy way around it. Thanks — Dan Harris, Mar 10 '11 at 13:44

score 3 · Answer 1 · answered Mar 10 '11 at 13:30

3

Have you considered Linq to XML?

http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx

answered Mar 10 '11 at 13:30

TimS

5,922
6
35
55

score 0 · Answer 2 · answered Mar 10 '11 at 13:29

0

What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?

answered Mar 10 '11 at 13:29

hungryMind

6,931
4
29
45

score 0 · Answer 3 · answered Mar 10 '11 at 13:32

0

Use

string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
    Console.WriteLine(m.Groups[1].Value);

See it here.

answered Mar 10 '11 at 13:32

Aliostad

80,612
21
160
208

score 0 · Accepted Answer · answered Mar 10 '11 at 13:32

Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):

(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")

You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.

score 0 · Answer 5 · edited May 23 '17 at 12:04

Problems you'll encounter by using regular expressions instead of an XML DOM:

All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, ', >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

5 Answers5