1

I have made a web crawler that takes thousands of Urls from a text file and then crawls the data on that webpage.
Now that it has many Urls; some Urls are broken too.
So it gives me the error:

Traceback (most recent call last):  
File "C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 57, in <module> 

crawl_data("http://www.foasdasdasdasdodily.com/r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show")  

  File "C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 18, in crawl_data   

 data = requests.get(url)   

File "C:\Python27\lib\site-packages\requests\api.py", line 67, in get   
return request('get', url, params=params, **kwargs)   

File "C:\Python27\lib\site-packages\requests\api.py", line 53, in request   
return session.request(method=method, url=url, **kwargs) 

File "C:\Python27\lib\site-packages\requests\sessions.py", line 468, in request  
 resp = self.send(prep, **send_kwargs)  

File "C:\Python27\lib\site-packages\requests\sessions.py", line 576, in send  
r = adapter.send(request, **kwargs)  

File "C:\Python27\lib\site-packages\requests\adapters.py", line 437, in send  
  raise ConnectionError(e, request=request)  

requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.foasdasdasdasdodily.com', port=80): Max retries exceeded with url: /r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0310FCB0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

Here's my code:

def crawl_data(url):
    global connectString
    data = requests.get(url)
    response = str( data )
    if response != "<Response [200]>":
        return
    soup = BeautifulSoup(data.text,"lxml")
    titledb = soup.h1.string

But it still gives me the same exception or error.

I simply want it to ignore that Urls from which there is no response and move on to the next Url.

Umer Javed
  • 404
  • 6
  • 17
  • you want to ignore a particular exception, check http://stackoverflow.com/q/574730/754550 , http://stackoverflow.com/q/730764/754550 , http://stackoverflow.com/q/21553327/754550 – miracle173 Jan 17 '16 at 11:08
  • Should I remove my question because we already have an answer on these pages. My scenario is different by the way. – Umer Javed Jan 17 '16 at 19:16

2 Answers2

3

You need to learn about exception handling. The easiest way to ignore these errors is to surround the code that processes a single URL with a try-except construct, making you code read something like:

try:
    <process a single URL>
except requests.exceptions.ConnectionError:
    pass

This will mean that if the specified exception occurs your program will just execute the pass (do nothing) statement and move on to the next

holdenweb
  • 33,305
  • 7
  • 57
  • 77
2

Use try-except:

def crawl_data(url):
    global connectString
    try:
        data = requests.get(url)
    except requests.exceptions.ConnectionError:
        return

    response = str( data )
    soup = BeautifulSoup(data.text,"lxml")
    titledb = soup.h1.string
Kenly
  • 24,317
  • 7
  • 44
  • 60
  • It's generally considered a bad practice to silently ignore every possible exception in a block of code. – Bryan Oakley Jan 17 '16 at 12:41
  • I think it's more correctly stated he wants to ignore `ConnectionError`. If he introduced a typo into his code, or the data that came back wasn't what he expected, those errors would get ignored, too. It also seems very odd to check for the string representation of the response code, rather than checking the actual response code. – Bryan Oakley Jan 17 '16 at 12:53
  • He is right. I just want to ignore the ConnectionError so I tried to convert the response object to the string and then I compared that. – Umer Javed Jan 17 '16 at 17:00
  • As @Bryan Oakley do not ignore every exception in a block of code, so try to get url or do nothing. – Kenly Jan 17 '16 at 18:18