1

trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.

to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.

<div class="result js-result card-mobile ">
<div class="result-firstline-container">
    <div class="result-firstline-title">
        <a
            class="result-title js-result-title"

            href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"

        >
            The Top Social Networking Sites People Are Using
        </a>
    </div>

</div>

<a
    class="result-url js-result-url"

    href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
    The Top
</p>
</div>

<div class="result js-result card-mobile ">
    <div class="result-firstline-container">
        <div class="result-firstline-title">
            <a
                class="result-title js-result-title"

                href="http://www.ebizmba.com/articles/social-networking- websites"

            >
                Top 15 Most Popular Social Networking Sites | January 2019
            </a>
        </div>

    </div>

    <a
        class="result-url js-result-url"

        href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
    </a>
    <p class="result-snippet">
        Top 15 Most 
    </p>

</div>     

i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.

        int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
        int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
        urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);

to grab url i am using the following:

var regexURLParser = new Regex(@"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);

what i want is to grab is the url from these:

        <a
            class="result-title js-result-title"

            href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"

        >

        <a
            class="result-title js-result-title"

            href="http://www.ebizmba.com/articles/social-networking-websites"

        >

so that the outcome shows only:

https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites 
Zoe
  • 27,060
  • 21
  • 118
  • 148
Buddhihin
  • 39
  • 3
  • 1
    For the sake of humanity, [you might want to consider using a proper XML parser instead](https://stackoverflow.com/a/1732454/5623232). – NPras Apr 24 '19 at 06:00
  • Make your life a lot simpler, and use AngleSharp. – Ian Kemp Apr 24 '19 at 06:32
  • @NPras why XML? Am I missing something? – slow Apr 24 '19 at 06:38
  • @sLw HTML is a variety of XML, therefore an XML parser is likely to simplify the solution to this problem. – dumetrulo Apr 24 '19 at 08:26
  • If we're being pedantic, technically HTML isn't strict XML ([unlike XHTML, which is valid XML](https://stackoverflow.com/questions/5558502/is-html5-valid-xml)). But that example seems like a properly formed XML fragment, so a simple XML parser is likely to work. A HTML parser is likely heavier, and you'd need to rely on third-party code. – NPras Apr 25 '19 at 23:28

1 Answers1

2

You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.

To add HTMLAgilityPack using NuGet

go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3

after the installation you can extract Urls like below.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"put html string here");

var listOfUrls = new List<string>();
doc.DocumentNode.SelectNodes("//a").ToList()
   .ForEach(x=> 
           {
              //Use HasClass method to filter elements 
              if (!string.IsNullOrEmpty(x.GetAttributeValue("href", "")) 
                   && x.HasClass("result-title") && x.HasClass("js-result-title"))
              {
                 listOfUrls.Add(x.GetAttributeValue("href", ""));
              }
           });

listOfUrls.ForEach(x => Console.WriteLine(x));

EDIT

Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.

Another way

shorter and another way to get filtered values.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"put html string here");

var listOfUrls = doc.DocumentNode.Descendants("a")
    .Where(x => x.Attributes["class"] != null 
                && x.Attributes["class"].Value == "result-title js-result-title")
    .Select(x => x.GetAttributeValue("href", "")).ToList();
Umair Anwaar
  • 1,130
  • 9
  • 27
  • well, this method also grabs the 2nd url since it is in a href attribute. i only need the url after this (class="result-title js-result-title") tag. – Buddhihin Apr 24 '19 at 23:03
  • @Buddhihin to filter element with class you can use `HasClass` method. Please see Edit. – Umair Anwaar Apr 25 '19 at 07:35
  • thanks, i tried your 2nd method, it works. is doing this way faster than regex? also, are there other ways to achieve it? – Buddhihin Apr 27 '19 at 20:12
  • Yes, there are other way to do the same, Regex can be a faster but not suitable for this. if you found this helpful then accept the answer so other can take benefits. – Umair Anwaar Apr 29 '19 at 06:50