0

I'm trying to download several large files from a server through a url.

When downloading the file through Chrome or Internet Explorer it takes around 4-5 minutes to download the file, which has a size of around 100 MB.

But when I try to do the same download using either PyCurl

buffer = BytesIO()
ch = curl.Curl()
ch.setopt(ch.URL, url)
ch.setopt(curl.TRANSFERTEXT, True)
ch.setopt(curl.AUTOREFERER, True)
ch.setopt(curl.FOLLOWLOCATION, True)
ch.setopt(curl.POST, False)
ch.setopt(curl.SSL_VERIFYPEER, 0)
ch.setopt(curl.WRITEFUNCTION, buffer.write)
ch.perform()

Or using requests

r = requests.get(url).text

I get the

'pycurl.error: (56, 'OpenSSL SSL_read: Connection was reset, errno 10054')'

or

 '[Errno 10054] An existing connection was forcibly closed by the remote host.'

When I look in Chrome during the download of the large file this is what I see

 General:
 Referrer Policy: no-referrer-when-downgrade 

Requests Header:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cache-Control: no-cache
Connection: keep-alive 
Cookie : JSESSIONID = ****
Pragma: no-cache
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/... (KHTML, like Gecko) Chrome/... Safari/...

Is there anything I can do in my configuration to not have the connection close, similar to when I access it through my browser? Or is it on the server side the problem is?

EDIT

To add more information. Most of the time after the request is made is spent waiting for the server to put together the data before the actual download starts (it is generating an XML file by aggregating data from different data sources).

kspr
  • 980
  • 9
  • 23
  • Follow instructions in http://pycurl.io/docs/latest/troubleshooting.html#transfer-related-issues to obtain diagnostic output for your transfer. – D. SM Feb 13 '20 at 03:53

1 Answers1

0

Try adding headers and cookies to your request.

headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36" } 
cookies = { "Cookie1": "Value1"} 


r = requests.get(url, headers=headers, cookies=cookies)
isopach
  • 1,783
  • 7
  • 31
  • 43
  • what is driver in this instance? – kspr Feb 12 '20 at 09:12
  • @kspr Sorry I have simplified my answer. – isopach Feb 12 '20 at 09:15
  • So how do I get the cookie value before making the request? Not sure if I understand. I guess I should not put Value1? – kspr Feb 12 '20 at 09:16
  • @kspr you take it from your browser and put it in there. PHPSESSID can be reused. Otherwise, you can always send a request first and use the cookie value in the next request. – isopach Feb 12 '20 at 09:17