Regular expression to extract the content inside the script tag in php

Question

I tried to extract the download url from the webpage. the code which tried is below

function getbinaryurl ($url)
   {

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
    $value1 = curl_exec($curl);
    curl_close($curl);        
    $start = preg_quote('<script type="text/x-component">', '/');
    $end = preg_quote('</script>', '/');
    $rx = preg_match("/$start(.*?)$end/", $value1, $matches);
    var_dump($matches);
}
 $url = "https://www.sourcetreeapp.com/download-archives";
 getbinaryurl($url);

this way i am getting the tags info not the content inside the script tag. how to get the info inside.

expected result is: https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip, https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe, https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi

i am very much new in writing these regular expressions. can any help me pls.

why looking for script tags ? on your url, the links you are looking for are "enclosed" by A tags... not sure what you are up to here — Pierre, Feb 09 '20 at 17:43
as i am not much aware of extraction i depend on the script tags. if any better idea of extract the info please help me with that. — mvsr, Feb 09 '20 at 17:50

score 2 · Accepted Answer · answered Feb 09 '20 at 18:02

Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.

Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[@type="text/x-component"][contains(text(), "macURL")] which broken down is

//script = any script node
[@type="text/x-component"] = which has an attribute called type with the specific value
[contains(text(), "macURL")] = who's text contains the string macURL

The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...

function getbinaryurl ($url)
{

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
    $value1 = curl_exec($curl);
    curl_close($curl);

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($value1);
    libxml_use_internal_errors(false);

    $xp = new DOMXPath($doc);

    $srcs = $xp->query('//script[@type="text/x-component"][contains(text(), "macURL")]');
    foreach ( $srcs as $src )   {
        $content = json_decode( $src->textContent, true);
        echo $content['params']['macURL'] . PHP_EOL;
        echo $content['params']['windowsURL'] . PHP_EOL;
        echo $content['params']['enterpriseURL'] . PHP_EOL;
    }
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);

which outputs

https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi

For whatever reason I wasn't able to retrieve the url with curl unless I set curl options `CURLOPT_SSL_VERIFYHOST` and `CURLOPT_SSL_VERIFYPEER` to 0. — Booboo, Feb 09 '20 at 20:42
https://stackoverflow.com/questions/4372710/php-curl-https has some more info about curl & https. — Nigel Ren, Feb 09 '20 at 20:51

Regular expression to extract the content inside the script tag in php

1 Answers1