I am trying to find out if a set of urls exist or if they will give back an error without having to go through all of them. I am using python 3.5.0. The basic URl is http://www5.registraduria.gov.co/CuentasClarasPublicoCon2014/Consultas/Candidato/Reporte/2 and it changes by adding a simple number at the end (from 0 to at most 10000). I tried the following:
import requests, os, bs4
url = 'http://www5.registraduria.gov.co/CuentasClarasPublicoCon2014/Consultas/Candidato/Reporte/'
start = 0 #Start the count,
urllook = url+str(start) #Add the count to the url
res = requests.get(urllook)#Talk to the page
goodid = []#create empty array of candidate id
while start<10000: #It seems like there are no candidates more than 5000, just to be sure made it big
res = requests.get(urllook) #Talk to page again
if res.ok: #If no error
start=start+1 #increase count by one
print(start) #what page I'm at
goodid.append(start) #Add to the goodid array
urllook=url+str(start) #Increase the URL by one
else: #If error then
start=start+1 #Increase count by one
print(start) #what page I'm at
urllook=url+str(start) #Increase the URL by one
print("LOL")
This works for small number of urls, like from 0 to 100 but I want to make sure I have them all. I was hoping to save the goodid object in a .txt file I could access later but it seems to be having an error I can't figure out, at random urls, after some time. This is the error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request
httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 378, in _make_request
httplib_response = conn.getresponse()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1174, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 282, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 243, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/socket.py", line 571, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [Errno 54] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/adapters.py", line 370, in send
timeout=timeout
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 609, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/util/retry.py", line 245, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/packages/six.py", line 309, in reraise
raise value.with_traceback(tb)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 378, in _make_request
httplib_response = conn.getresponse()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1174, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 282, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 243, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/socket.py", line 571, in readinto
return self._sock.recv_into(b)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/Gaborio/Dropbox/Fall 2015/Special Interest/Final Paper/Pythondata/thebigone.py", line 15, in <module>
res = requests.get(urllook) #Talk to page again
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/adapters.py", line 412, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
It seems like the main problem is that it returns a buffering error, but it does so at different urls, it has happened at /197 in a different run at 273 and the last one at 692. How can solve this error? What does it mean?
Unrelated but if anyone have any suggestions on doing this faster I welcome them, I'm fairly new to python, and not a programming expert in general.
EDIT: I understand now that connection reset by peer means the server closed the connection, but I still don't understand why, specially I don't understand why it is happening at random URLs