trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.
to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
The Top Social Networking Sites People Are Using
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
The Top
</p>
</div>
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking- websites"
>
Top 15 Most Popular Social Networking Sites | January 2019
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
</a>
<p class="result-snippet">
Top 15 Most
</p>
</div>
i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.
int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);
to grab url i am using the following:
var regexURLParser = new Regex(@"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);
what i want is to grab is the url from these:
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking-websites"
>
so that the outcome shows only:
https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites