0

I have a webpage. If I look at the "view-source" of the page, I find multiple instance of following statement:

<td class="my_class" itemprop="main_item">statement 1</td>
<td class="my_class" itemprop="main_item">statement 2</td>
<td class="my_class" itemprop="main_item">statement 3</td>

I want to extract data like this:

statement 1
statement 2
statement 3

To accomplish this, I have made a method "GetContent" which takes "URL" as parameter and copy all the content of the webpage source in a C# string.

private string GetContent(string url)
{
    HttpWebResponse response = null;
    StreamReader respStream = null;

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
    request.Timeout = 100000;
    response = (HttpWebResponse)request.GetResponse();
    respStream = new StreamReader(response.GetResponseStream());
    return respStream.ReadToEnd();
}

Now I want to create a method "GetMyList" which will extract the list I want. I am searching for the possible regex which can serve my purpose. Any help is highly appreciated.

Meraqp
  • 161
  • 2
  • 8
  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – l'L'l Aug 29 '18 at 03:57
  • I did not properly understand the accepted answer mentioned in your link. – Meraqp Aug 29 '18 at 04:00
  • The short version of the duplicate answer is that you should use an XML parser for this task because Regex is not sophisticated enough to understand all the constructs of HTML. – Rufus L Aug 29 '18 at 04:04

2 Answers2

3

using the HTML AgilityPack, this would be really easy...

  HtmlDocument doc= new HtmlDocument ();
  doc.LoadHtml(html);
  //var nodes = doc.DocumentNode.SelectNodes("//td//text()");
  var nodes = doc.DocumentNode.SelectNodes("//td[@itemprop=\"main_item\"]//text()");
  var list = new List<string>();
            foreach (var m in nodes)
            {
                list.Add(m.InnerText);
            }

But if you want Regex, Try this :

            string regularExpressionPattern1 = @"<td.*?>(.*?)<\/td>";
            Regex regex = new Regex(regularExpressionPattern1, RegexOptions.Singleline);
            MatchCollection collection = regex.Matches(html.ToString());
            var list = new List<string>();
            foreach (Match m in collection)
            {
                list.Add( m.Groups[1].Value);
            }
Hossein
  • 3,083
  • 3
  • 16
  • 33
  • Thanks for the answer. I think this will pick all the from the string. I want only those having attribute itemprop="main_item". – Meraqp Aug 29 '18 at 04:07
  • @Meraqp So use `Html AgilityPack` like `var nodes = doc.DocumentNode.SelectNodes("//td[@itemprop=\"main_item\"]//text()");` – Hossein Aug 29 '18 at 04:14
1

Hosseins answer is pretty much the solution (and I would recommend you to use a parser if you have the option) but a regular expression with non-capturing paraentheses ?: would bring you the extracted data statement 1 or statement 2 as you need it:

IEnumerable<string> GetMyList(string str)
{
    foreach(Match m in Regex.Matches(str, @"(?:<td.*?>)(.*?)(?:<\/td>)"))
        yield return m.Groups[1].Value;
}

See Explanation at regex101 for a more detailed description.

dontbyteme
  • 1,221
  • 1
  • 11
  • 23