0

I tried to extract the download url from the webpage. the code which tried is below

function getbinaryurl ($url)
   {

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
    $value1 = curl_exec($curl);
    curl_close($curl);        
    $start = preg_quote('<script type="text/x-component">', '/');
    $end = preg_quote('</script>', '/');
    $rx = preg_match("/$start(.*?)$end/", $value1, $matches);
    var_dump($matches);
}
 $url = "https://www.sourcetreeapp.com/download-archives";
 getbinaryurl($url);

this way i am getting the tags info not the content inside the script tag. how to get the info inside.

expected result is: https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip, https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe, https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi

i am very much new in writing these regular expressions. can any help me pls.

mvsr
  • 29
  • 5
  • why looking for script tags ? on your url, the links you are looking for are "enclosed" by A tags... not sure what you are up to here – Pierre Feb 09 '20 at 17:43
  • Wouldn’t domDocument be a better tool for this? – Tim Morton Feb 09 '20 at 17:49
  • as i am not much aware of extraction i depend on the script tags. if any better idea of extract the info please help me with that. – mvsr Feb 09 '20 at 17:50

1 Answers1

2

Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.

Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[@type="text/x-component"][contains(text(), "macURL")] which broken down is

  • //script = any script node
  • [@type="text/x-component"] = which has an attribute called type with the specific value
  • [contains(text(), "macURL")] = who's text contains the string macURL

The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...

function getbinaryurl ($url)
{

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
    $value1 = curl_exec($curl);
    curl_close($curl);

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($value1);
    libxml_use_internal_errors(false);

    $xp = new DOMXPath($doc);

    $srcs = $xp->query('//script[@type="text/x-component"][contains(text(), "macURL")]');
    foreach ( $srcs as $src )   {
        $content = json_decode( $src->textContent, true);
        echo $content['params']['macURL'] . PHP_EOL;
        echo $content['params']['windowsURL'] . PHP_EOL;
        echo $content['params']['enterpriseURL'] . PHP_EOL;
    }
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);

which outputs

https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi
Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • For whatever reason I wasn't able to retrieve the url with curl unless I set curl options `CURLOPT_SSL_VERIFYHOST` and `CURLOPT_SSL_VERIFYPEER` to 0. – Booboo Feb 09 '20 at 20:42
  • https://stackoverflow.com/questions/4372710/php-curl-https has some more info about curl & https. – Nigel Ren Feb 09 '20 at 20:51