0

I am trying to find out if a set of urls exist or if they will give back an error without having to go through all of them. I am using python 3.5.0. The basic URl is http://www5.registraduria.gov.co/CuentasClarasPublicoCon2014/Consultas/Candidato/Reporte/2 and it changes by adding a simple number at the end (from 0 to at most 10000). I tried the following:

import requests, os, bs4


url = 'http://www5.registraduria.gov.co/CuentasClarasPublicoCon2014/Consultas/Candidato/Reporte/'              
start = 0 #Start the count, 
urllook = url+str(start) #Add the count to the url
res = requests.get(urllook)#Talk to the page
goodid = []#create empty array of candidate id

while start<10000: #It seems like there are no candidates more than 5000, just to be sure made it big
    res = requests.get(urllook) #Talk to page again
    if res.ok: #If no error
        start=start+1 #increase count by one
        print(start) #what page I'm at
        goodid.append(start) #Add to the goodid array
        urllook=url+str(start) #Increase the URL by one
    else: #If error then
        start=start+1 #Increase count by one
        print(start) #what page I'm at
        urllook=url+str(start) #Increase the URL by one
print("LOL")

This works for small number of urls, like from 0 to 100 but I want to make sure I have them all. I was hoping to save the goodid object in a .txt file I could access later but it seems to be having an error I can't figure out, at random urls, after some time. This is the error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
    body=body, headers=headers)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 378, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1174, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 282, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 243, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/socket.py", line 571, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 54] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 609, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/util/retry.py", line 245, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/packages/six.py", line 309, in reraise
    raise value.with_traceback(tb)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
    body=body, headers=headers)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 378, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1174, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 282, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 243, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/socket.py", line 571, in readinto
    return self._sock.recv_into(b)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/Gaborio/Dropbox/Fall 2015/Special Interest/Final Paper/Pythondata/thebigone.py", line 15, in <module>
    res = requests.get(urllook) #Talk to page again
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/adapters.py", line 412, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

It seems like the main problem is that it returns a buffering error, but it does so at different urls, it has happened at /197 in a different run at 273 and the last one at 692. How can solve this error? What does it mean?

Unrelated but if anyone have any suggestions on doing this faster I welcome them, I'm fairly new to python, and not a programming expert in general.

EDIT: I understand now that connection reset by peer means the server closed the connection, but I still don't understand why, specially I don't understand why it is happening at random URLs

Gaborio
  • 19
  • 1
  • 7
  • 1
    Possible duplicate of [What does "connection reset by peer" mean?](http://stackoverflow.com/questions/1434451/what-does-connection-reset-by-peer-mean) – ivan_pozdeev Nov 05 '15 at 23:59
  • That explains what the connection reset by peer is but I don't understand why it seems to happen at random. – Gaborio Nov 06 '15 at 00:57
  • It looks like you are going as fast as you can, that can be considered bad form. Perhaps the web site is noticing you hitting them repeatedly at speed and resetting your connection. 2 suggestions: 1) add a small delay between requests 2) put your request in try/except block (and add another delay when you get an exception) – davejagoda Nov 06 '15 at 02:44
  • Great! That seems to have solved it, I didn't put the try/except block because I don't know how to use them yet, but I while in the future. I did add some delay and it is working (still running as I write this) and it is the longest run so far. Thank you! – Gaborio Nov 06 '15 at 03:02
  • Actually the try/except is what I should have done from the beginning, but I was not aware of it, thank you very much. – Gaborio Nov 06 '15 at 03:14

1 Answers1

0

Two recommendations:

Use time.sleep between requests: https://docs.python.org/3.0/library/time.html#time.sleep

Use try/except on errors: https://docs.python.org/3/tutorial/errors.html#handling-exceptions

davejagoda
  • 2,420
  • 1
  • 20
  • 27