0

I want to search for all hyperlinks in a large String with specific keywords in it. The hyperlink shall contain the following keywords: manufacture's name (e.g. Samsung), and download.

The structure of the String is like following:

<h2><a href="https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/magician/" h="ID=SERP,5141.1">
Samsung Magician Software | Samsung V-NAND …</a></h2><div class="b_suffix b_secondaryText nowrap">
<a href="http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=nl-NL&amp;dl=nl&amp;lp=EN_NL&amp
;a=https%3a%2f%2fwww.samsung.com%2fsemiconductor%2fminisite%2fssd%2fproduct%2fconsumer%2fmagician%2f" h="ID=SERP
,5148.1">Deze pagina vertalen</a></div></div><div class="b_caption"><div class="b_attribution b_nav" u="0N|5131|
4627393552323160|A_CGcDrlDvI4S9DANesILYWPEhC2j7ly"><cite>https://www.<strong>samsung</strong>.com/.../minisite/
ssd/product/consumer/<strong>magician</strong></cite><span class="c_tlbxTrg"><span class="c_tlbxH" H="BASE:CACHE
DPAGEDEFAULT" K="SERP,5142.1"></span></span></div><p><strong>Samsung Magician</strong> software is designed to 
help you manage your <strong>Samsung</strong> SSD with a simple, intuitive user interface. <strong>Download
/strong> files &amp; find supported models.</p></div><div Class="dlCollapsedCnt"><div class="b_vlist2col b_deep"
><ul><li><h3 class="deeplink_title"><a href="https://www.samsung.com/semiconductor/minisite/ssd/product/consumer
/860evo" h="ID=SERP,5345.1">860 EVO</a></h3><p>The SSD to trust. The newest edition to the world’s best-selling*
SATA SSD series, the …</p></li></ul><ul></ul></div></div><form class="sc_rf dlsbox b_externalSearch b_divsec 
dlCollapsed" name="sc_rf dlsbox b_externalSearch b_divsec dlCollapsed5349" onsubmit="sa_ResubmitForm.Resubmit
(this, 'http:\/\/www.samsung.com\/nl\/function\/search\/espsearchResult?keywords=&amp;input_keyword=%query');
return false;"><input id="sc_rf dlsbox b_externalSearch b_divsec dlCollapsed5349_si" type="hidden" value="SERP,
5349.1"/> <input class="b_hide" value="ctxtb" id="h_c0" name="h_c0" /><input type="text" id="c0" name="query" 
maxlength="100" style="width:466px" class="ctxt" qi="1" data-gt="" placeholder="Zoeken in samsung.com" onfocus="
sj_evt.fire('LogTextBoxFocus', 'SERP','5515.1')" />  <span id="3C65B3_3_btn" class="cbtn" data-wire="I;button_
init;; |" data-appns="SERP" data-k="5517.1"><input type="submit" name="submit" id="sb_submit" value="Zoeken" 
style="width:100px" /></span></form></li><li class="b_algo"><div class="b_title"><h2>
<a href="https://www.samsung.com/semiconductor/minisite/ssd/download/tools/" h="ID=SERP,5156.1">

Actually only the first hyperlink is observed and the keyword "download" is missing. So my output is the hyperlink https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/magician/ instead of https://www.samsung.com/semiconductor/minisite/ssd/download/tools/

Here is my actual code:

String strStart = "<a href=\"";
String strEnd = "\" h=\"ID=";         
            
if (stringsource.Contains(strStart) && 
    stringsource.Contains(strEnd) && 
    stringsource.Contains(manufacture))
{
    int Start, End;
    Start = stringsource.IndexOf(strStart, 0)+strStart.Length;
    End = stringsource.IndexOf(strEnd, Start);

    Console.WriteLine("Text: " + stringsource.Substring(Start, End - Start));
    String substring = stringsource.Substring(Start, End -  Start);               
}
Rufus L
  • 36,127
  • 5
  • 30
  • 43
Vik0809
  • 379
  • 4
  • 18

2 Answers2

0

One problem is that you don't keep searching for a match after you find the first one. The second problem is that you don't check if the string contains the term "download".

You can solve these by using a loop, where we keep searching from where the last search left off. We then also add a check for "download" in our results. If we find it, then we can either return that right away, or add it to a list and return the list if we may expect more than one result (which is what I've done below):

public static List<string> GetManufacturerDownloadLinks(string input, string manufacturer)
{
    var result = new List<string>();
    if (string.IsNullOrEmpty(input)) return result;

    var strStart = "<a href=\"";
    var strEnd = "\" h=\"ID=";

    // Find the first index of the start string
    var start = input.IndexOf(strStart);

    while (start > -1 && start < input.Length - strEnd.Length)
    {
        // Find the first index of the end string after the start string
        var end = input.IndexOf(strEnd, start + strStart.Length);
        if (end < 0) break;

        // Get the substring between start and end
        var substring = input.Substring(start, end - start);

        // If it contains the required terms, add it to our list
        if (substring.Contains(manufacturer) && substring.Contains("download"))
            result.Add(input.Substring(start, end - start));

        // Find the next index of the start string after end
        start = input.IndexOf(strStart, end + strEnd.Length);
    }

    return result;
}
Rufus L
  • 36,127
  • 5
  • 30
  • 43
0

There seems to be only 1 hyperlink in the string with the word 'Samsung' and 'download'. i.e https://www.samsung.com/semiconductor/minisite/ssd/download/tools/

The best way to approach this is to get all hyperlinks in the String and check for manufacturers name ( In this case 'Samsung') and 'download' in each hyperlink.

Check this link for scraping hyperlinks in HTML Code. This can be easily using regex.