6

I have the HTML code of a webpage in a text file. I'd like my program to return the value that is in a tag. E.g. I want to get "Julius" out of

<span class="hidden first">Julius</span>

Do I need regular expression for this? Otherwise what is a string function that can do it?

disasterkid
  • 6,948
  • 25
  • 94
  • 179
  • 1
    You do not want regex. HTML is too complex for regex parsing. Here is the infamous answer to that point : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – gbtimmon Nov 05 '12 at 14:45
  • 1
    Also what do you actually want? Assuming you don't just want "Julius" returned everytime do you want all text between tags? All text between tags that have a class of "first"? – Fishcake Nov 05 '12 at 14:46

4 Answers4

13

You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.

You can use below code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var itemList = doc.DocumentNode.SelectNodes("//span[@class='hidden first']")//this xpath selects all span tag having its class as hidden first
                  .Select(p => p.InnerText)
                  .ToList();

//itemList now contain all the span tags content having its class as hidden first
Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • No, the C#/.NET regex engine is certainly capable of matching non-[REGULAR](http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html#comment_40) expressions. But you are correct that other tools are a better choice for parsing HTML. – ridgerunner Nov 05 '12 at 16:37
  • @ridgerunner u r right..i guess `.net` has the best `regex` engine..:D – Anirudha Nov 05 '12 at 16:38
7

I would use the Html Agility Pack to parse the HTML in C#.

carla
  • 1,970
  • 1
  • 31
  • 44
Pablo Santa Cruz
  • 176,835
  • 32
  • 241
  • 292
2

I'd strongly recommend you look into something like the HTML Agility Pack

wp78de
  • 18,207
  • 7
  • 43
  • 71
KingCronus
  • 4,509
  • 1
  • 24
  • 49
1

i've asked the same question few days ago and ened up using HTML Agility Pack, but here is the regular expressions that you want

this one will ignore the attributes

<span[^>]*>(.*?)</span>

this one will consider the attributes

<span class="hidden first"[^>]*>(.*?)</span>
user1570048
  • 880
  • 6
  • 35
  • 69