5

I wrote a crawler in python to download some web pages from a website based on some given urls. I noticed that occasionally my program hang at this line "conn.getresponse()". No exception were thrown and the program simply waited there for ever.

conn = httplib.HTTPConnection(component.netloc)
conn.request("GET", component.path + "?" + component.query)
resp = conn.getresponse() #hang here

I read the api doc and it says that (to add a timeout):

conn = httplib.HTTPConnection(component.netloc, timeout=10)

However, it does not allow me to "retry" the connection. What is the best practice to retry the crawling after a timeout?

For example, I'm thinking of the following solution:

trials = 3
while trials > 0:
    try:
        ... code here ...
    except:
        trials -= 1

Am I in the right direction?

Chu-Cheng Hsieh
  • 191
  • 1
  • 7
  • Sometimes it happens that the python libraries interpret some headers differently than web browsers (as happened in [this question](http://stackoverflow.com/q/8527862/183066)). Hence, just to make sure, I think you could try to open the same URL in a web browser. – jcollado Dec 20 '11 at 06:55

2 Answers2

0

You can add a timeout for the case that you get no data. The interesting part is you need to add it to the HTTPConnection and not the request, like so:

conn = httplib.HTTPConnection(component.netloc, timeout=10)
conn.request("GET", component.path + "?" + component.query)
resp = conn.getresponse() #now this will timeout if the other side hangs!

I have not tried it, but seems timeout can also be set/changed like in this question

Alternatively, if you want to timeout if the response is taking too long even if you are receiving some data from the connection, you use eventlets as in this example

ntg
  • 12,950
  • 7
  • 74
  • 95
0

However, it does not allow me to "retry" the connection.

Yes, the timeout is designed to push this policy back where it belongs, in your code (and out of httplib).

What is the best practice to retry the crawling after a timeout?

It's very application dependent. How long can your crawler stand to postpone its other work? How badly do you want it to crawl deeply into each site? Do you need to be able to endure slow, oversubscribed servers? What about servers which have throttles or other countermeasures when they encounter crawlers? While I'm asking, are you respecting robots.txt?

Since the answers to these questions likely vary widely, it makes sense for you to tune this to your crawler's needs, the sites you tend to crawl (assuming there are trends), and your WAN performance.

Brian Cain
  • 14,403
  • 3
  • 50
  • 88
  • Have the same problem. Both the server and the client are part of my teams microservices operating on a Kubernetes cluster, process just halts forever.... – ntg Nov 02 '20 at 10:13
  • @ntg sounds like you want to add a timeout, too. – Brian Cain Nov 03 '20 at 14:57
  • yes, though I want to timeout not to do other things, but to get a signal and be able to recover the data... turns out that the timeout is part of the connection properties... In hindsight that should have been expected... – ntg Nov 04 '20 at 12:34