2

I've been developing a python script to download a csv file from a webserver. My usual methodology for doing this is to right click on the web page, go to "inspect element" (in chrome), switch to the network view, and then click the link to see what the traffic looks like. I was expecting to see something like "https://domain.com/file_i_need.csv", but instead what I got was the location of perl script. Since I'm not familiar exactly with how this works exactly, I just copied the curl command (right click on relevant network traffic and "Copy as Curl"). So, i initially just issued a curl command to os.system(). And then once I got that working I tried to modify the script to use pycurl. Now I'd like to change this to use the requests library (mostly for elegance/neatness). I've seen this question answered but I'm wondering if there's a different way of doing it since the backend is slightly different than expected. I see that urllib.urlretreive() is recommended as an alternative but I'm guessing that won't work here.

question: How can I download a file from a webserver where the http to generate the file is a perl script?

i.e. https:://domain.com/file_maker.pl?param1=12345

curl command: ``curl "https://release.domain.com/release_cr_new.pl?releaseid=26851&v=2&m=a&dump_csv=1" -H "Accept-Encoding: gzip,deflate,sdch" -H "Host: release.domain.com" -H "Accept-Language: en-US,en;q=0.8" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8" -H "Referer: https://release.domain.com/release_cr_new.html?releaseid=26851&v=2&m=a" -H "Cookie: releasegroup=Development; XR77=3q3pzeMQc1gf-jDlpNtkgr4WvZYqxVZSYzeQHfGAwMTAeZQ6D3g2e6w; __utma=147924903.423899313.1373397746.1378841205.1380290587.15; __utmc=147924903; __utmz=147924903.1380290587.15.14.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); pubcookie_s_release.domain.com=Hm17WT1VJbPpBLOQ+NhtyBbZlfO9qntsoGP0P8BEVeh4d0ay+THE3EkNLc6PV5rJ40Ui7uj/+c6f2tzZYWOJ/j+dyoP5l+J//rL875K9ERxio1FZeiUVRQgeabetZ+V1AWlrkjURmAw2SU1hEz/f2pCt0sHe06C14vWA95PFu1Smp6viWOL8QnaPHFWhGU3uQQH5Wxex0CziHbrYXHuKwnxwWejvVtTM8e8aIHkM2WuB3IIDhGMVtd0r292owvcv6Rvcl7tYSoQaQYfSpPZreXo4tNO9gh9ZIGqao8LaCfG5Fw8+Ow5wQKf2ryVuPc8Ah4MTIzC1UeZxBtxSTyZk5E1in7LCV9E+d/5G84U+ECcdn166gJg1iMG68II81YJO9fYs91gGtA5iUa6h3RpFo+ysBkqbHjCpetOUxfHh47sdr4nUoIWEb0LfKVTYfvmW6BNGx4m90PqE8aQlknv7zxqAQrujqe7h5zSpmaD5UjrfRwp7lYD+6e88vgQzLgWlcAA=; _session_id=eb0095f849a509c3cf65b43680b3002a; default_column_2=bugid%2Cloginname%2Ccomponent%2Cversionvalue%2Cbugdate%2Cshortdescription%2Cpriority%2Cstatus%2Cqacontact%2Csqa_status%2Cis_dep" -H "Connection: keep-alive"`

sorry for the big block of text.

Community
  • 1
  • 1
Ramy
  • 20,541
  • 41
  • 103
  • 153

1 Answers1

1

If you want to stream the data from the server:

# UNTESTED
import requests
import csv

# Connect to the web server.
response = requests.get("https:://domain.com/file_maker.pl?param1=12345", stream=True)
# Read the data as CSV
data = csv.reader(response.raw)

# Use the data
for line in data:
  print line

Or, if you want to download the file from the server and store it locally:

# UNTESTED
import requests
import csv

# Connect to the web server.
response = requests.get("https:://domain.com/file_maker.pl?param1=12345")

# Store the data
with open('outfile', 'w') as outfile:
    outfile.write(response.content)

It appears in your particular case, the CGI script requires some specific header or cookie in order to return the correct data. I don't know which header or cookie it requires, so just send them all:

url = "https://release.domain.com/release_cr_new.plreleaseid=26851&v=2&m=a&dump_csv=1"
headers = {
  "Accept-Encoding" : "gzip,deflate,sdch",
  "Accept-Language" : "en-US,en;q=0.8",
  "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36",
  "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Referer" : "https://release.domain.com/release_cr_new.html?releaseid=26851&v=2&m=a",
  "Cookie" : "releasegroup=Development; XR77=3q3pzeMQc1gf-jDlpNtkgr4WvZYqxVZSYzeQHfGAwMTAeZQ6D3g2e6w; __utma=147924903.423899313.1373397746.1378841205.1380290587.15; __utmc=147924903; __utmz=147924903.1380290587.15.14.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); pubcookie_s_release.domain.com=Hm17WT1VJbPpBLOQ+NhtyBbZlfO9qntsoGP0P8BEVeh4d0ay+THE3EkNLc6PV5rJ40Ui7uj/+c6f2tzZYWOJ/j+dyoP5l+J//rL875K9ERxio1FZeiUVRQgeabetZ+V1AWlrkjURmAw2SU1hEz/f2pCt0sHe06C14vWA95PFu1Smp6viWOL8QnaPHFWhGU3uQQH5Wxex0CziHbrYXHuKwnxwWejvVtTM8e8aIHkM2WuB3IIDhGMVtd0r292owvcv6Rvcl7tYSoQaQYfSpPZreXo4tNO9gh9ZIGqao8LaCfG5Fw8+Ow5wQKf2ryVuPc8Ah4MTIzC1UeZxBtxSTyZk5E1in7LCV9E+d/5G84U+ECcdn166gJg1iMG68II81YJO9fYs91gGtA5iUa6h3RpFo+ysBkqbHjCpetOUxfHh47sdr4nUoIWEb0LfKVTYfvmW6BNGx4m90PqE8aQlknv7zxqAQrujqe7h5zSpmaD5UjrfRwp7lYD+6e88vgQzLgWlcAA=; _session_id=eb0095f849a509c3cf65b43680b3002a; default_column_2=bugid%2Cloginname%2Ccomponent%2Cversionvalue%2Cbugdate%2Cshortdescription%2Cpriority%2Cstatus%2Cqacontact%2Csqa_status%2Cis_dep"
}

response = requests.get(url, headers=headers)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • sorry i was unclear, the url is NOT "domain.com/file_i_need.csv". If that were the case, i could do it without asking a question. but it's not the case. instead i have to call a perl script on the web server to generate and download. – Ramy Sep 30 '13 at 16:10
  • doing this just saves a bunch of html to my outfile. This is why i went the curl route originally, because the curl command (generated from browser) looks quite different than the html request. – Ramy Sep 30 '13 at 16:23
  • updated my question. Hopefully I masked out enough so it's safe to post. – Ramy Sep 30 '13 at 17:00
  • You probably need to include those headers and cookies in your request. See my edit. – Robᵩ Sep 30 '13 at 17:12
  • brilliant. sorry i was being dense there. – Ramy Sep 30 '13 at 17:17