c# substring - parse all text in between

Question

trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.

to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.

<div class="result js-result card-mobile ">
<div class="result-firstline-container">
    <div class="result-firstline-title">
        <a
            class="result-title js-result-title"

            href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"

        >
            The Top Social Networking Sites People Are Using
        </a>
    </div>

</div>

<a
    class="result-url js-result-url"

    href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
    The Top
</p>
</div>

<div class="result js-result card-mobile ">
    <div class="result-firstline-container">
        <div class="result-firstline-title">
            <a
                class="result-title js-result-title"

                href="http://www.ebizmba.com/articles/social-networking- websites"

            >
                Top 15 Most Popular Social Networking Sites | January 2019
            </a>
        </div>

    </div>

    <a
        class="result-url js-result-url"

        href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
    </a>
    <p class="result-snippet">
        Top 15 Most 
    </p>

</div>

i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.

        int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
        int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
        urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);

to grab url i am using the following:

var regexURLParser = new Regex(@"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);

what i want is to grab is the url from these:

        <a
            class="result-title js-result-title"

            href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"

        >

        <a
            class="result-title js-result-title"

            href="http://www.ebizmba.com/articles/social-networking-websites"

        >

so that the outcome shows only:

https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites

For the sake of humanity, [you might want to consider using a proper XML parser instead](https://stackoverflow.com/a/1732454/5623232). — NPras, Apr 24 '19 at 06:00
@sLw HTML is a variety of XML, therefore an XML parser is likely to simplify the solution to this problem. — dumetrulo, Apr 24 '19 at 08:26
If we're being pedantic, technically HTML isn't strict XML ([unlike XHTML, which is valid XML](https://stackoverflow.com/questions/5558502/is-html5-valid-xml)). But that example seems like a properly formed XML fragment, so a simple XML parser is likely to work. A HTML parser is likely heavier, and you'd need to rely on third-party code. — NPras, Apr 25 '19 at 23:28

Umair Anwaar · Accepted Answer · 2019-04-25T08:11:00.267

You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.

To add HTMLAgilityPack using NuGet

go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3

after the installation you can extract Urls like below.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"put html string here");

var listOfUrls = new List<string>();
doc.DocumentNode.SelectNodes("//a").ToList()
   .ForEach(x=> 
           {
              //Use HasClass method to filter elements 
              if (!string.IsNullOrEmpty(x.GetAttributeValue("href", "")) 
                   && x.HasClass("result-title") && x.HasClass("js-result-title"))
              {
                 listOfUrls.Add(x.GetAttributeValue("href", ""));
              }
           });

listOfUrls.ForEach(x => Console.WriteLine(x));

EDIT

Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.

Another way

shorter and another way to get filtered values.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"put html string here");

var listOfUrls = doc.DocumentNode.Descendants("a")
    .Where(x => x.Attributes["class"] != null 
                && x.Attributes["class"].Value == "result-title js-result-title")
    .Select(x => x.GetAttributeValue("href", "")).ToList();

well, this method also grabs the 2nd url since it is in a href attribute. i only need the url after this (class="result-title js-result-title") tag. — Buddhihin, Apr 24 '19 at 23:03
@Buddhihin to filter element with class you can use `HasClass` method. Please see Edit. — Umair Anwaar, Apr 25 '19 at 07:35
thanks, i tried your 2nd method, it works. is doing this way faster than regex? also, are there other ways to achieve it? — Buddhihin, Apr 27 '19 at 20:12
Yes, there are other way to do the same, Regex can be a faster but not suitable for this. if you found this helpful then accept the answer so other can take benefits. — Umair Anwaar, Apr 29 '19 at 06:50

c# substring - parse all text in between

1 Answers1