0

I have the following problem:

I use a PS script that scrapts a web page. Based on the response, it concludes if the site is up or down.

My code works

foreach ($site in $websitesArray) {
    $Counter = 5
    $ErrorCounter = 0
    if ($site -like 'http://*') {
        $siteSplit = $site.Replace("http://", "")
    }
    else {
        $siteSplit = $site.Replace("https://", "")
    }
    $ip = [System.Net.Dns]::GetHostAddresses($siteSplit)

    for (int i = 0 ; i -lt $Counter ; i++) {
        Try {
            $HTMLstring = $web.DownloadString($site)
        }
        Catch {
            $ErrorCounter++ 
        }
        Start-Sleep 5s 
    }

    if ($ErrorCounter > 3) {
        $body += "The website " + $site + " (" + $ip + ") has returned an HTTP error and is down <br />" 
    }
}

The problem is that the following is returned for pages that have a Cookie Policy pop up:

The remote server returned an error: (500) Internal Server Error.

What workaround could I use to prevent this false positive from happening? Keep in mind that I have a list of websites so a hardcoded solution wouldn't be helping me.

Sage Pourpre
  • 9,932
  • 3
  • 27
  • 39
user3127554
  • 521
  • 1
  • 7
  • 28
  • I'd use the Internet explorer com object instead (if on Windows). `$ie = new-object -com "InternetExplorer.Application"` You can set visibility to false and then you are set. Webclient is a http client and does not render Javascript. I found overtime it caused me much more headache to try to use it rather than usin IE com application. Headless browser projects exists in .net that would render the page properly in that situation but they are not native to powershell so more work is needed to integrate them in your project. – Sage Pourpre Feb 15 '19 at 17:25
  • @SagePourpre can't we do a simple http request and get the answer (and assume the site is down if the answer of the request is not in the 200 range?) – user3127554 Feb 28 '19 at 08:25
  • Yes, for sure. Since it is just for up/down status, you could do that. If it work, then you are better off it since it is simpler. If you need to do more scraping / gather content and / or you still have false positive, using a (hidden) IE com object is probably the easiest way to prevent any problems since it will be seen by the site as just a normal browser. – Sage Pourpre Mar 01 '19 at 00:17
  • Alternatively, with your webclient (or possibly webrequest) , you could also try to simulate a specific browser by setting specific headers used by IE , Chrome or Firefox. Reference : https://stackoverflow.com/questions/11841540/setting-the-user-agent-header-for-a-webclient-request – Sage Pourpre Mar 01 '19 at 00:19
  • You could also extend the webclient class to make it cookie aware in addition to the headers : https://stackoverflow.com/questions/1777221/using-cookiecontainer-with-webclient-class – Sage Pourpre Mar 01 '19 at 00:20
  • You used the word "scrape" in your question, which lead me to think you were doing more than checking status. To check status, you can simply use `Invoke-WebRequest` powershell method and check if the response is in the 200 range. – Sage Pourpre Mar 01 '19 at 00:21
  • thanks, i got it managed to work via checking the response code – user3127554 Mar 15 '19 at 12:02

0 Answers0