0

I am doing a very simple task: parsing a website, looking for

<tbody>this is what important for me</tbody>`

and returning but I just cannot make it work. When I do:

Regex.Matches(webData, @"<tbody>(.*?)</tbody>")

it gives me no results. This, however, gives me 2 results:

Regex.Matches(webData, @"tbody")

but again, this

Regex.Matches(webData, @"tbody(.*?)tbody")

gives me nothing (so I assume escaping is not the problem). I found about (.*?) at this page and I assumed it will be pretty easy to use, but I just cannot work it out.

Community
  • 1
  • 1
Andrius Naruševičius
  • 8,348
  • 7
  • 49
  • 78

3 Answers3

2

Using regex for parsing html is not recommended

regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.

Use an html parser like htmlagilitypack

You can use this code to retrieve all tbody's content using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var tbodyList= doc.DocumentNode.SelectNodes("//tBody")
                  .Select(p => p.InnerText)
                  .ToList();

tbodyList contains all tbody values in the entire document!

Anirudha
  • 32,393
  • 7
  • 68
  • 89
2

To parse a web page use a real html parser like HtmlAgilityPack

string html = "<tbody>this is what important for me</tbody>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var text = doc.DocumentNode.Descendants("tbody").First().InnerText;
I4V
  • 34,891
  • 6
  • 67
  • 79
0

I recommend HtmlAgilityPack too.

You can use also XPath (http://www.w3schools.com/xpath/)

On the I4V example:

var text = doc.DocumentNode.SelectSingleNode("//tbody").InnerText;
briba
  • 2,857
  • 2
  • 31
  • 59