Regex, trying to get text within XML tag

Question

I have a tag and I am trying to get the actual text from it.

An example of this tag is: (And all are formatted the same)

<description>
&lt;div class=&quot;field field-name-field-body-small field-type-text-long field-label-hidden&quot;&gt;The evolution of your League of Legends match history is now live!
&lt;/div&gt;
&lt;div class=&quot;field field-name-field-article-media field-type-file field-label-hidden&quot;&gt;
&lt;div id=&quot;file-13180&quot; class=&quot;file file-image file-image-jpeg&quot;&gt;
&lt;img typeof=&quot;foaf:Image&quot; src=&quot;/sites/default/files/styles/large/public/upload/mh_640x360.jpg?itok=z_Nn84Op&quot; width=&quot;480&quot; height=&quot;270&quot; alt=&quot;&quot; title=&quot;&quot; /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;field field-name-custom-author field-type-entityreference field-label-hidden&quot;&gt;
&lt;span class=&quot;article_detail&quot;&gt;&lt;span class=&quot;posted_by&quot;&gt;By Riot MattEnth&lt;/span&gt;
&lt;/span&gt;&lt;/div&gt;
</description>

I want the text on the first line, which in this example contains (far right of code snippet)

The evolution of your League of Legends match history is now live!

Is there a simple way to do this with my following code? Right now it returns that entire string of junk.

XDocument xmlFile = XDocument.Load(@"http://na.leagueoflegends.com/en/rss.xml");
var LoLdescriptions = from service in xmlFile.Descendants("item")
                     select (string)service.Element("description");
ViewBag.descriptions = LoLdescriptions.ToArray();

...moving into View...

@ViewBag.descriptions[0]

If this is not hard, is there also a way to the get the last line as well? In this case By Riot MattEnth

Thank you!

XML code for reference: http://na.leagueoflegends.com/en/rss.xml

Might I suggest using an actual XML parser for this? Depending on the language there is probably already quite a few very good ones — Chrispresso, Jul 02 '14 at 21:01
I have looked into this, but I have no idea where I would insert it into my current code. I need to use XDocument to load it, and I only care about the first three items. I have tried tacking random things on, but I usually get compile errors as I have never used a parser before :/ — Austin, Jul 02 '14 at 21:03
http://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c — mplungjan, Jul 02 '14 at 21:15
You really don't want to do this with regular expressions. I suggest you remove regex from the title and from the tags. — Jim Mischel, Jul 02 '14 at 21:32
Content of that `` node is HTML. As result you need to use HTML parser (common suggestion HtmlAgilityPack) to read that section and selsct nodes you like. If you want to stick with regEx - this question is indeed duplicate of all times regex favorite question linked earlier. — Alexei Levenkov, Jul 02 '14 at 21:42
Yea I was able to fix it by taking out my query, then decoding the first three items, and then regex'ing all HTML tags. Only 2 lines of code :) Thanks for the help everyone! — Austin, Jul 03 '14 at 12:11

score 1 · Accepted Answer · answered Jul 02 '14 at 21:10

1

I don't know which language this is. But it seems to me that you need to read the file first and convert all HTML entities. Than you can pass real XML/HTML to your parser as a string.

Don't use regex. Try to get some XPath-able tree from which you can select the element content (i.e. the text).

answered Jul 02 '14 at 21:10

feeela

29,399
7
59
71

Your concept helped me find a similar solution! I just stuffed the `` tags into an array and then used the following parser. `servicing[i] = Regex.Replace(Server.HtmlDecode(servicing[i]), @"<[^>]*>", String.Empty);` Thanks for the hint! – Austin Jul 08 '14 at 13:25

zx81 · Answer 2 · 2014-07-02T22:16:53.243

0

Interesting format! You can use this:

(?<=<description>.*[\r\n]*.*?&quot;&gt;).*

On the demo, you'll have to scroll to the right to see the match.

Explanation

The lookbehind (?<=<description>.*[\r\n]*.*?">) asserts that what precedes the current position is <description>, then .*[\r\n]* any chars to the end of the line, then new line chars, then any chars up to ">
.* matches everything to the end of the line

In C#, you can retrieve the match like this:

var myRegex = new Regex(@"(?<=<description>.*[\r\n]*.*?&quot;&gt;).*");
string resultString = myRegex.Match(yourString).Value;

edited Jul 02 '14 at 22:16

answered Jul 02 '14 at 22:11

zx81

41,100
9
89
105

Not sure, probably because I shouldn't be using regex as everyone keeps mentioning haha. Thank you for the reply, I did happen to find a very simple fix as well. I just grabbed the first 3 values that I wanted, decoded them, then parsed out all HTML tags. `var servicing = LoLdescriptions.ToArray(); for (int i = 0; i < 4; i++) { servicing[i] = Regex.Replace(Server.HtmlDecode(servicing[i]), @"<[^>]*>", String.Empty); }` – Austin Jul 03 '14 at 12:10

Regex, trying to get text within XML tag

2 Answers2