2

I want to download an xls file by clicking the button "Export to excel" from the following url: https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD.

More specifically the button: name = "ctl00$MainContent$btndata". I've already been able to do this using selenium, but, I plan on building a docker image with this script and running as a docker container because this xls is regularly updated and I need the most current data on my local machine and it doesn't make sense to have a browser open that often to fetch this data. I understand there are headless versions of chrome and firefox although I don't believe they support downloads. Also, I understand that web get will not work in this situation because the button is not a static link to the resource. Maybe there's a completely different approach for downloading and updating this data to my computer?

import urllib
import requests
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
    'Origin': 'https://www.tampagov.net',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
    'Accept-Encoding': 'gzip,deflate,br',
    'Accept-Language': 'en-US,en;q=0.5',
}

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f, "html.parser")
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']

formData = (
    ('__EVENTVALIDATION', eventvalidation),
    ('__VIEWSTATE', viewstate),
    ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
    ('Accept-Encoding', 'gzip, deflate, br'),
    ('Accept-Language', 'en-US,en;q=0.5'),
    ('Host', 'apps,tampagov.net'),
    ('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'))



payload = urllib.urlencode(formData)
# second HTTP request with form data
r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", params=payload)
print(r.status_code, r.reason)
  • 1
    You can look into the source code for that web page, then find out what to do for 'Export Excel' button; normally it will trigger an ajax request to one url. then in your script, simulate one ajax request to same url to get the excel data. You don't need to care the html content. – Sphinx Mar 27 '18 at 18:51
  • @Sphinx If I inspect the button then under Elements > Event Listeners > Click, It looks like there's some links to ajax requests? I really have no idea what im looking at though. I'm really bad w html & js. I'm trying to learn though lol –  Mar 27 '18 at 18:58
  • 1
    Actually, I found out how that button works. But I don't think it is a good idea to post it out. The owner of that website may kick my ass... But a hint, you can open web console, then switch to 'network' tab, then click 'Export to Excel' button, you should see one http 'POST' request in network tab. Finally in your script, simulate one same 'http post', you will get the data you need. – Sphinx Mar 27 '18 at 19:15
  • 1
    Another thing: pay attention on **__VIEWSTATE**, **__EVENTVALIDATION**, Recommand you google these two keywords to find out what they are. – Sphinx Mar 27 '18 at 19:20
  • I was able to find _VIEWSTATE and _EVENTVALIDATION in the network tab within the POST request and then read up on them for a few minutes, so when simulating the http post, this data obviously needs to be sent to the server so it thinks the button was clicked, but how do Include that in my script? –  Mar 28 '18 at 15:16
  • check this url: [How to simulate HTTP post request using Python Requests module?](https://stackoverflow.com/questions/13147914/how-to-simulate-http-post-request-using-python-requests-module); or you can google it, you will find many tutorials on http simulate. – Sphinx Mar 28 '18 at 15:52
  • So in theory, would something like this work? `import requests URL = 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD' payload = { '_EVENTVALIDATION': 'value', '_VIEWSTATE': 'value, } session = requests.session() r = requests.post(URL, data=payload) print r.cookies ` –  Mar 28 '18 at 16:17
  • You can try it first, then if respond http 200, you should almost reach the finish line. – Sphinx Mar 28 '18 at 16:28
  • What do I set as the value for _EVENTVALIDATION and _VIEWSTATE? I have so many more questions too, but i'm learning and I really appreciate you helping me with this. @Sph –  Mar 28 '18 at 16:51
  • one solution, simulate ‘http get’ for that webpage, then you will find these two fields in somewhere of the response (the response content is one html) – Sphinx Mar 28 '18 at 16:55
  • I did a curl command to simulate http get and found both fields, they look like the same values within the webconsole under the network tab. They're just super long strings of random chars. Am I supposed to just copy and paste those strings as 'value' paired with the keys '_EVENTVALIDATON' and '_VIEWSTATE'?? I'm starting to get more confused :( –  Mar 28 '18 at 17:18
  • check this url: [post request using python to asp.net page](https://stackoverflow.com/questions/14746750/post-request-using-python-to-asp-net-page) – Sphinx Mar 28 '18 at 17:28
  • Okay so i'm now scarping eventvalidation and viewstate, although how can this script download the file I need? I understand im trying to simulate the original post request from when I physically clicked the download button, what else am I missing? Also, do you know where by default this file will download to? I'm assuming the same directory as my script... @Sph –  Mar 28 '18 at 17:57
  • check this [how to download one file](https://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python), if the server accepts your http post, it will return the data you need, then save the file into your local computer. – Sphinx Mar 28 '18 at 18:28
  • Below is one header of one successful http post. `Access-Control-Allow-Origin https://www.tampagov.net Cache-Control private Content-Disposition attachment;filename="Police_calls_for_service.xls" Content-Length 4094745 Content-Type application/vnd.ms-excel Date Wed, 28 Mar 2018 18:26:16 GMT Server Microsoft-IIS/10.0 X-AspNet-Version 4.0.30319 X-FRAME-OPTIONS SAMEORIGIN X-Powered-By ASP.NET` – Sphinx Mar 28 '18 at 18:32
  • **Content-Disposition attachment;filename="Police_calls_for_service.xls" Content-Length 4094745** that is the file you need. – Sphinx Mar 28 '18 at 18:33
  • You can add your codes into your question, then we can help you from there. – Sphinx Mar 29 '18 at 16:01
  • @sphinx I updated my question to include my code, it's basically an ugly mess, i've been trying a million different things to get it working but still nothing :( sidenote: I have no idea how beautiful soup works and havent looked into it, I just saw it in another post used to get __eventvalidation and __viewstate form data. –  Mar 29 '18 at 16:35
  • I dont think i even need to encode formData since im not using urllib.request? it is what it is :'( lmao –  Mar 29 '18 at 16:45
  • See my answer below. Anyway, you are very close.... – Sphinx Mar 29 '18 at 17:45

2 Answers2

1

First: I removed import urllib because 'requests' is enough.

Some issues you have:

  1. You don't need to create one nested tuple then apply urllib.urlencode, uses one dictionary instead that is one reason why requests is so popular.

  2. You'd better populate all parameters for the http post request. like below what I did, otherwise, the request may be rejected by the backend.

  3. I added one simple codes to save the content to the local.

PS: for those form parameters, you can get their values by analysis the html responsed from http get. Also you can customize the parameters as you need, like page size etc.

Below is a working sample:

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

def downloadExcel():
    headers = {
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
        'Origin': 'https://www.tampagov.net',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
        'Accept-Encoding': 'gzip,deflate,br',
        'Accept-Language': 'en-US,en;q=0.5',
    }

    r = requests.get("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", headers=headers)
    # parse and retrieve two vital form values
    if not r.status_code == 200:
        print('Error')
        return
    soup = BeautifulSoup(r.content, "html.parser")
    viewstate = soup.select("#__VIEWSTATE")[0]['value']
    eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
    print ('__VIEWSTATE:', viewstate)
    print ('__EVENTVALIDATION:', eventvalidation)
    formData = {
        '__EVENTVALIDATION': eventvalidation,
        '__VIEWSTATE': viewstate,
        '__EVENTTARGET': '',
        '__EVENTARGUMENT': '',
        '__VIEWSTATEGENERATOR': '49DF2C80',
        'MainContent_RadScriptManager1_TSM':""";;System.Web.Extensions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35:en-US:59e0a739-153b-40bd-883f-4e212fc43305:ea597d4b:b25378d2;Telerik.Web.UI, Version=2015.2.826.40, Culture=neutral, PublicKeyToken=121fae78165ba3d4:en-US:c2ba43dc-851e-4009-beab-3032480b6a4b:16e4e7cd:f7645509:24ee1bba:c128760b:874f8ea2:19620875:4877f69a:f46195d3:92fe8ea0:fa31b949:490a9d4e:bd8f85e4:58366029:ed16cbdc:2003d0b8:88144a7a:1e771326:aa288e2d:b092aa46:7c926187:8674cba1:ef347303:2e42e72a:b7778d6c:c08e9f8a:e330518b:c8618e41:e4f8f289:1a73651d:16d8629e:59462f1:a51ee93e""",
        'search_block_form':'',
        'ctl00$MainContent$btndata':'Export to Excel',
        'ctl00_MainContent_RadWindow1_C_RadGridVehicles_ClientState':'',
        'ctl00_MainContent_RadWindow1_ClientState':'',
        'ctl00_MainContent_RadWindowManager1_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl00$PageSizeComboBox':'20',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time$dateInput':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_dateInput_ClientState':'{"enabled":true,"emptyMessage":"","validationText":"","valueAsString":"","minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00","lastSetTextBoxValue":""}',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_ClientState':'{"minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00"}',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1address':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1address_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1case_description':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1case_description_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_grid':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1report_number':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1report_number_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_max_date':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_rowcount':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox':'20',
        'ctl00_MainContent_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState':'',
        'ctl00_MainContent_RadGrid1_rfltMenu_ClientState':'',
        'ctl00_MainContent_RadGrid1_gdtcSharedTimeView_ClientState':'',
        'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_SD':'[]',
        'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_AD':'[[1900,1,1],[2099,12,31],[2018,3,29]]',
        'ctl00_MainContent_RadGrid1_ClientState':'',
        }

    # second HTTP request with form data
    r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", data=formData, headers=headers)
    print('received:', r.status_code, len(r.content))
    with open(r"C:\Users\xxx\Desktop\test\test\apps.xls", "wb") as handle:
        for data in tqdm(r.iter_content()):
            handle.write(data)

downloadExcel()
Sphinx
  • 10,519
  • 2
  • 27
  • 45
  • Since i'm on a linux machine I was getting some errors on line 68, specifically when trying to read the specified file where the mode was set to write+binary, but that was the easiest fix of this whole project. +rep to you my friend you've helped me so much @sphinx –  Mar 30 '18 at 14:36
0

Find out the URL you need to fetch as @Sphinx explains, and then simulate it using something similar to:

import urllib.request
import urllib.parse

data = urllib.parse.urlencode({...})
data = data.encode('ascii')

with urllib.request.urlopen("http://...", data) as fd:
    print(fd.read().decode('utf-8'))

Take a look at the documentation of urllib.

Acorn
  • 24,970
  • 5
  • 40
  • 69