2

I am using vb.net and have a handful of URLs that refuse to be crawled. I would really like to detect when a crawl returns a null response, but seem to be having a problem figuring out HOW.

Code:

Public Sub crawler_ProcessPageCrawlCompleted(sender As Object, e As PageCrawlCompletedArgs)

    pageNumber += 1
    Try

        Dim crawledPage As CrawledPage = e.CrawledPage


        If (Not (crawledPage.HttpWebResponse Is Nothing) And Not (crawledPage.WebException Is Nothing)) Or crawledPage.HttpWebResponse.StatusCode <> HttpStatusCode.OK Then
            CrawlFailed(e.CrawledPage.ToString, Failed)
        Else

            If String.IsNullOrEmpty(crawledPage.Content.Text) Then
                CrawlFailed(e.CrawledPage.ToString, NoContent)
            Else
                StoreContent(e)
            End If

        End If


    Catch ex As Exception
        RichTextBox1.AppendText(e.CrawledPage.ToString & " - " & ex.Message & vbCrLf)
    End Try

End Sub

I put in the Catch-Try to capture that exception, but I would really rather capture it in my CrawlFailed subroutine to do something with that URL.

I have tried to figure out how to use GetResponseStream and Stream.Null, but can't seem to figure out how to detect an empty stream :( I'm just missing something, but I've googled all over the place and the best I can find is this thread: crawledPage.HttpWebResponse is null in Abot.

However - that doesn't really explain HOW to detect and code against the result.

Community
  • 1
  • 1
Andrew
  • 437
  • 7
  • 18
  • `GetResponseStream and Stream.Null`? Do you mean this issue (http://stackoverflow.com/questions/22921555/check-null-for-httpwebresponse)? You don't need to check for Stream.Null and doing so accomplishes nothing. Also, HTTP does not recognize a "null" response but you can detect an empty stream by reading from it or possibly by using the Length property. – usr Nov 08 '16 at 13:38
  • I did read through that, although I just now saw your final analysis on it. Since I'm using Abot, I'm not quite sure how I detect whatever it is giving me as an output, then? I'm sure I'm missing something out of your response @usr, maybe you can help me understand? – Andrew Nov 08 '16 at 14:34
  • I don't know anything about Abot but if `crawledPage.HttpWebResponse` is of type `HttpWebResponse` then my answer applies. Just read from that stream to obtain the content and possibly find it empty. If you can't make that work post the reading code. – usr Nov 08 '16 at 14:35

1 Answers1

1

I had the same issue (dotnet core), with a fiddler session I could see the response actually did come. But I also saw it took a long time for the site to return result.

Try setting config.HttpRequestTimeoutInSeconds to a higher value. It resolved my issues.

André
  • 750
  • 11
  • 24