Parsing XML Out of the Middle of a String

Question

I am working with .NET and I want to do some string manipulations like this:

Input:

hi hello <bbb name='ahhahdch'>MR.JKROY</bbb>.how are you.Let's meet
<bbb name='bbcbc'>SUSANNE</bbb>. Our team lead     is <bbb name='cdcdcd'>JACK</bbb>, from .net.

Output:

hi hello MR.JKROY.how are you.Let's meet. Our team lead is JACK , from .net.

In a nutshell, I want to remove the XML tags (including attributes) and to retrieve the value of the tag.

casperOne · Answer 1 · 2012-03-15T15:31:17.953

1

You don't have a valid XML document; if you find most (or all) of your input is like this you can easily wrap the content in dummy tags to ensure that the parsers will not fail (assuming the inner node content is valid when it's the content of another XML element), like so:

<root>
hi hello <bbb name='ahhahdch'>MR.JKROY</bbb>.how are you.Let's meet
<bbb name='bbcbc'>SUSANNE</bbb>. Our team lead     is <bbb name='cdcdcd'>JACK</bbb>, from .net.
</root>

Once you have a valid XML document, you can then use the XmlDocument class to parse the content and then get the text with the elements removed using the InnerText property:

string xml = <content from above>;

var doc = new XmlDocument();
doc.LoadXml(xml);

// Gives you only the text.
Console.WriteLine(doc.InnerText);

Or use the XDocument class and then get the text from the Value property on the XElement exposed by the Root property on XDocument:

XDocument doc = XDocument.Parse(xml);

// Gives you only the text.
Console.WriteLine(doc.Root.Value);

edited Mar 15 '12 at 15:31

answered Mar 15 '12 at 03:13

casperOne

73,706
19
184
253

This will not work as this is not a valid XML document in his example. He needs a removal algorithm in this case – Justin Pihony Mar 15 '12 at 03:18
@Justin I've updated my answer to reflect that; it's a simple fix, just wrap the content in dummy tags to ensure parsing succeeds. – casperOne Mar 15 '12 at 03:26
1

It still would not work as there is plain text in random places throughout the string. The regex that Tats suggested is the cleanest solution in this case. – Justin Pihony Mar 15 '12 at 03:33
@Justin the random text is fine, everything in the example string, if wrapped in a set of tags is a valid XML document. – casperOne Mar 15 '12 at 03:41
@Justin which parts of the string would present a problem for an XML parser if the contents were wrapped in a set of tags? – casperOne Mar 15 '12 at 03:58
Hrmm, you learn something new everyday. I can't say that would be pretty XML, but it does appear that it passes a syntax check. I guess I just had never seen it and thus assumed plain text could not be riddled throughout other tags... The overhead of loading this into a doc and parsing it still seems unnecessary for this specific question – Justin Pihony Mar 15 '12 at 04:30
His example is not well-formed therefore he cannot parse it with an xml parser. My answer (which unbelievably) got down-voted achieves what he asked for which is to strip the markup tags from the string. I know it works because i actually took the time to test it. – Peter Mar 16 '12 at 07:35
@Peter The well-formed aspect is addressed in my answer. Additionally, HTML markup and XML are *fundamentally different*, which is probably the reason for the downvotes. – casperOne Mar 17 '12 at 04:19
Hey that's fine, no one's down-voting your answer. The question doesn't ask how to parse xml it asks how to get rid of it. I'm just a bit annoyed that my answer answered the question exactly and I get admonished for it. You answer changes the question and no one bats an eyelid. I'll get over it though :) – Peter Mar 17 '12 at 04:23

score 0 · Accepted Answer · edited May 23 '17 at 11:56

0

hiya if its HTML tag removal only then use this

string result = Regex.Replace(htmlText, @"<(.|\n)*?>", string.Empty);

If you are getting XML feed and you can create the string using LINQ good answere here: remove tags from a xml file written to a string?

How can I strip HTML tags from a string in ASP.NET?

Cheers

edited May 23 '17 at 11:56

Community

1
1

answered Mar 15 '12 at 03:17

Tats_innit

33,991
10
71
77

1

Parsing XML or HTML with regular expeessions is one of the worst approaches you can take given the prevalence of dependable parsers. It's just not worth the trouble when a solution is in the framework (XDocument and XmlDocument for XML) or easily obtained (HtmlAgilityPack for HTML through NuGet). – casperOne Mar 15 '12 at 03:31
1

Hiya Casper - yes - but in this case seems like he only have 1 (one) line in his file; Yep I do agree with the performance hit with regex, Will remove my Answer mate. Again This solution is a cowboy solution to deal with one line not parsing whole HTML tag file. Cheers. – Tats_innit Mar 15 '12 at 03:34
2

you don't have to remove your answer, just that regex for valid XML is really not about the performance but more about getting all valid XML tag representations correct. [It's trickier than you think](http://stackoverflow.com/a/1732454/50776) – casperOne Mar 15 '12 at 03:47
@casperOne +1 for the link mate! Legend! – Tats_innit Mar 15 '12 at 03:50

score -3 · Answer 3 · answered Mar 15 '12 at 03:52

-3

Using the HTML Agility pack http://htmlagilitypack.codeplex.com/ can make this kind of thing much easier. You can go and query elements using XPath syntax.

You can get it through nuget but the project download from the codeplex site has an example of a utility class that converts html to text.

answered Mar 15 '12 at 03:52

Peter

7,792
9
63
94

He's asking for XML, not HTML. – casperOne Mar 15 '12 at 03:59
Well he doesn't have a valid XML document so given that he needs to parse garbage my advice is solid. Also the example I mentioned in the agility pack is still a useful bit of code for addressing his problem. – Peter Mar 16 '12 at 07:26

Parsing XML Out of the Middle of a String

3 Answers3