1

I am new to python, and I am learning to use it to scrape some data for me, but I cannot download an excel file for some reason that I don't understand. I want to download this excel file, when I open this link in any browser it tries to save an excel file:

http://www5.registraduria.gov.co/CuentasClarasPublicoCon2014/Consultas/Candidato/Formulario5xls/2

based on a previous question (see downloading an excel file from the web in python) I'm using requests in python 3 as this:

import requests, os


url="http://www5.registraduria.gov.co/CuentasClarasPublicoCon2014/Consultas/Candidato/Formulario5xls/2"

print("Downloading...")
requests.get(url)
output = open('test.xls', 'wb')
output.write(resp.content)
output.close()
print("Done!")

I think that the problem is not with the part of the code that writes the data since the test.xls is being created but as an empty file. the requests.get gives me the following error (followed bu several more):

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests/packages/urllib3/response.py", line 417, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

I also tried using the urllib but still failed.

Community
  • 1
  • 1
Gaborio
  • 19
  • 1
  • 7
  • 2
    There appear to be problems with the URL if you try with wget or curl. From wget: "2015-11-03 22:01:59 (52.7 KB/s) - Read error at byte 21504 (Success).Retrying.". From curl: "curl: (18) transfer closed with outstanding read data remaining" – Will Hogan Nov 04 '15 at 03:03

1 Answers1

2

Seems like this is a known issue.

One way to workaround it is to use http 1.0. To do this set the httplib variables _http_vsnand _http_vsn_str like so.

For Python 2

import requests, os
import httplib

httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

url="http://www5.registraduria.gov.co/CuentasClarasPublicoCon2014/Consultas/Candidato/Formulario5xls/2"

print("Downloading...")
resp = requests.get(url)
with open('test.xls', 'wb') as output:
    output.write(resp.content)
print("Done!")

For Python 3 httplib was renamed to http.client So the code becomes

import requests, os
import http.client

http.client.HTTPConnection._http_vsn = 10
http.client.HTTPConnection._http_vsn_str = 'HTTP/1.0'

url="http://www5.registraduria.gov.co/CuentasClarasPublicoCon2014/Consultas/Candidato/Formulario5xls/2"

print("Downloading...")
resp = requests.get(url)
with open('test.xls', 'wb') as output:
    output.write(resp.content)
print("Done!")
Paul Rooney
  • 20,879
  • 9
  • 40
  • 61
  • Thanks! This worked perfectly. Question, if I'm creating a loop over several links I only have to change the http variables once right? – Gaborio Nov 05 '15 at 06:41