0

Iam trying to follow the multithreading example given in: Python urllib2.urlopen() is slow, need a better way to read several urls but I seem to get a "thread error" and I am not sure what this really means.

urlList=[list of urls to be fetched]*100
def read_url(url, queue):
 my_data=[]
 try:
    data = urllib2.urlopen(url,None,15).read()
    print('Fetched %s from %s' % (len(data), url))
    my_data.append(data)
    queue.put(data)
except HTTPError, e:
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    my_data.append(data)
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urlList]
    for t in threads:
      t.start()
    for t in threads:
      t.join()
    return result

res=[]  
res=fetch_parallel()
reslist = []
while not res.empty: reslist.append(res.get())
print (reslist)

I get the following first error:

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "demo.py", line 76, in read_url
print('Fetched %s from %s' % (len(data), url))
TypeError: object of type 'instancemethod' has no len()

On the other hand, I see that sometimes, it does seem to fetch data, but then I get the following second error:

Traceback (most recent call last):
File "demo.py", line 89, in <module>
print str(res[0])
AttributeError: Queue instance has no attribute '__getitem__'

When it fetches data, why is the result not showing up in res[]? Thanks for your time.

Update After changing read to read() in the read_url() function, although the situation has improved (I now get many page fetches), but still got the error:

Exception in thread Thread-86:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "demo.py", line 75, in read_url
data = urllib2.urlopen(url).read()
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 429, in error
result = self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 605, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python2.7/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 435, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 502: Bad Gateway
Community
  • 1
  • 1
JohnJ
  • 6,736
  • 13
  • 49
  • 82
  • 1
    Well, have you checked your gateway? – Arafangion Jan 26 '12 at 02:39
  • @Arafangion: I wasnt sure what I could do with a 502 error. Isnt it beyond my control (sort of?) The reason I posted it was because I was not sure if it had to do anything with multithreading. – JohnJ Jan 26 '12 at 02:59

1 Answers1

4

Note that urllib2 is not thread-safe. Therefore, you should really use urllib3.

Some of your problems are entirely unrelated to threading. Threads just make the error reporting more complex. Instead of

data = urllib2.urlopen(url).read

you want

data = urllib2.urlopen(url).read()
#                               ^^

A 502 Bad gateway error indicates a server misconfiguration (most likely, an internal server of the web service you're connecting to is rebooting / not available). There's nothing you can do about it - the URL is just not reachable right now. Use try..except to handle these errors, for example by printing a diagnostic message, or scheduling the URL to be retrieved after an appropriate waiting period, or by leaving out the failed data set.

To get the values from the queue, you can do the following:

res = fetch_parallel()
reslist = []
while not res.empty():
  reslist.append(res.get_nowait()) # or get, doesn't matter here
print (reslist)

There is also no way around real error handling in case a URL is really unreachable. Simply re-requesting it might work in some cases, but you must be able to handle the case that the remote host is truly unreachable at this time. How you do that depends on your application's logic.

Community
  • 1
  • 1
phihag
  • 278,196
  • 72
  • 453
  • 469
  • 1
    Many thanks for that. The situation has improved, but I still get thread errors. I have updated the post accordingly. – JohnJ Jan 26 '12 at 02:37
  • 1
    Updated the answer with information about a 502 error. If you get more (unrelated) errors, you should open a new question. That enables this question to serve as a reference for everybody else with a similar problem, and simplifies the answers. – phihag Jan 26 '12 at 02:43
  • Where should the try: except statements go? (Sorry, I am new to threading) Also, how can I look at the page as a string? does "print res[0]" not suffice? when I "print res" I get "" How can I look at a more useful representations? something like res[0]? Thanks again for your suggestions and workarounds. – JohnJ Jan 26 '12 at 02:56
  • @JohnJ try..except goes around the urllib2 calls in `read_url`. That's also where you can debug the pages, by outputting `data`. Since a Queue is thread-safe and optimized performance, you can't really go in and read the whole Queue. Instead, call [`get`](http://docs.python.org/library/queue.html#Queue.Queue.get) until the queue is empty. – phihag Jan 26 '12 at 03:04
  • Thanks for that. I understand try: catch in read_url function. However, I am still not able to see any "result" from page. I looked at the "get" manual, but I guess it is not very clear. Should I use "queue.get_nowait()"? What I want is the result to be in an array form, so that I could use it for further processing. Where exactly should I implement queue.get() (or) how do I get the url contents into an array? Thanks again – JohnJ Jan 26 '12 at 03:29
  • print(data) shows me only the last url visited in the loop. On the other hand, what Id want is all the url results to be stored in an array for later access. When I use: "reslist = [] while not res.empty: reslist.append(res.get()); print (reslist)" (please see my updated code above), I get only "[]" as result. What am I doing wrong? I do understand the .get() method, but can I iterate (and store) through this somehow? like .get(0), .get(1) etc.. Sorry about my naivity on multi threading! – JohnJ Jan 26 '12 at 13:17
  • @JohnJ Oops, it should be `empty()`. Please do **not** significantly change your question. That will make it confusing to future visitors. Instead, just ask a new one (if you want me to answer, you can ping me by commenting here, or [message me via other channels](http://phihag.de)). Updated the answer in response to some of the edits. – phihag Jan 26 '12 at 13:33
  • thanks a tonne for that. It seems to have been solved. I do see that this is significantly faster than sequential processing. Wonderful. However, at times, I get weird errors like "URLError: ".. which I dont get when I sequentially do it. Strange. Do you think it is something to do with the timeout I have specified in urlopen? Many thanks again. – JohnJ Jan 26 '12 at 13:46
  • @JohnJ Oops, overlooked thread-safety, since the errors you got were unrelated to it. I amended the answer to contain a note (at the top) to the thread-safety problems. – phihag Jan 26 '12 at 13:55
  • thanks phihag for everything. It turns out that I cant really use this in a production environment. This is because at times it manages to go fetch all the urls without a problem, at times it fails to fetch any, and at times fetches a few handful of them. This is very random behavior for the same code: Actually, quite scary. When I do sequentially, I have no problems tho. What are your thoughts on it? Many thanks again – JohnJ Jan 26 '12 at 15:56
  • @JohnJ In theory, it could be a typical problem with the thread-safety of urllib2. However, I'd strongly assume that you have a buggy DNS resolver (or local server), or a buggy firewall. Opening up to a thousand concurrent connections should work fine. After that, you can get to weird limits, like the number of local TCP ports. urllib2 is unfortunately not designed to be a high-performance library. If you really want to know, generate a packet dump (for example with [wireshark](http://wireshark.org)), and analyze it. – phihag Jan 26 '12 at 16:00
  • Well I actually tried urllib3, but failed miserably: Just posted it: http://stackoverflow.com/questions/9021140/urllib3-maxretryerror – JohnJ Jan 26 '12 at 16:09