0

I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag. I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.

Here's the regular expression i've written so far, but it's not working:

".*;(?<Content>(\r|\n|.)*)</span>"

I also tried this but didnt work either:

"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"

Here is the div i want to retrieve the data from:

<div class="main">ABASASDFÓ 18/06/2014 17:38h&nbsp; Blabla Balbal&nbsp; <span class="type">15.80&#8364;&nbsp; </span>+1.94 % +0.30&#8364; &nbsp;|&nbsp;HOME <SPAN class="type2">11,398.70</span>&nbsp; +0.65 % +74.10</div>

EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
ERed
  • 628
  • 4
  • 18
  • Regexes and HTML? [Hm](http://stackoverflow.com/a/1732454/1016716). Anyway, your parentheses don't match. – Mr Lister May 14 '15 at 17:03
  • Lol, sorry mate. I'm a litle bit sleepy. I updated my regex but the question keeps being the same -.- Still not working :( – ERed May 14 '15 at 17:08
  • Unlike forum sites, we don't use "Thanks", or "Any help appreciated", or signatures on [so]. See "[Should 'Hi', 'thanks,' taglines, and salutations be removed from posts?](http://meta.stackexchange.com/questions/2950/should-hi-thanks-taglines-and-salutations-be-removed-from-posts). BTW, it's "Thanks in advance", not "Thanks in advanced". – John Saunders May 15 '15 at 01:25

3 Answers3

1

You want to use XPath for that. Something like this:

div/span/text()

I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!

XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx

The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx

So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.

Task
  • 3,668
  • 1
  • 21
  • 32
  • Thanks for the answer! It might be a possible solution. Anyway, do you think my piece of html can be parsed with XmlReader? – ERed May 14 '15 at 18:36
  • I'm sure it can, but I've added my full thoughts on the matter to my answer. – Task May 14 '15 at 18:46
  • Thank you so much! I'm gonna check it rn! Had not read your edit. – ERed May 14 '15 at 18:49
  • 1
    You're quite welcome. I expect that LINQ to XML is going to be a powerful tool in your arsenal if you can use it. I know I'll be looking for opportunities to use it myself. – Task May 14 '15 at 18:57
1

Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.

var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];

Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/

garryp
  • 5,508
  • 1
  • 29
  • 41
  • Thanks for the answer! But what you are achieving here is to remove every html tag which i find pretty useful but I actually wanted to target the first span tag especifically. – ERed May 14 '15 at 18:44
  • Any thoughts on the XmlReader @garryp? – ERed May 14 '15 at 18:45
  • 1
    It's cleaner, and probably more robust so long as you are sure you will be working with valid HTML. To be honest I'd be trying to persuade your client that it would better for them to allow you to use a mature library like HTMLAgility than asking you to reinvent wheels. Rolling your own is going to be painful whichever solution you go for. – garryp May 14 '15 at 18:50
1

This regex will capture the string:

"<span class=\"type\">(?<Content>([^<]*))</span>"

Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.

Håkan Fahlstedt
  • 2,040
  • 13
  • 17
  • Can you give me some clue about Path? Maybe some link so i can check it and dont bother you anymore :D? – ERed May 14 '15 at 18:47
  • 1
    Here's a link to start with. In your case the expression should be something like this: "//div/span[@class='type']". Here's an example how to use it in .NET http://www.codeproject.com/Articles/9494/Manipulate-XML-data-with-XPath-and-XmlDocument-C – Håkan Fahlstedt May 14 '15 at 18:57