1

I am trying to scrape a website using POST request to fill the form:

http://www.planning2.cityoflondon.gov.uk/online-applications/search.do?action=advanced

in python, this goes as follow:

import requests
import webbrowser

headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=OwXG0Hkxj+X9ELygHZa-aLQ5.undefined; _ga=GA1.3.1911942552.',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'www.planning2.cityoflondon.gov.uk',
'Origin': 'http://www.planning2.cityoflondon.gov.uk',
'Referer': 'http://www.planning2.cityoflondon.gov.uk/online-applications/search.do?action=advanced',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}

data = {
'searchCriteria.developmentType': '002',
'date(applicationReceivedStart)': '01/08/2000',
'date(applicationReceivedEnd)': '01/08/2018'
}

url = 'http://www.planning2.cityoflondon.gov.uk/online-applications/advancedSearchResults.do?action=firstPage'
test_file = 'planning_app.html'

with requests.Session() as session:
    r = session.post(url, headers = headers, data = data)
    with open (test_file, 'w') as file:
        file.write(r.text)
    webbrowser.open(test_file)

As you can see from the page reopened with webbrowser, this gives an error of outdated cookie.

For this to work I would need to manually go to the webpage, perform a query while opening the inspect panel of google chrome on the network tab, look at the cookie in the requests header and copy paste the cookie in my code. This would work until of course the cookie is expired again.

I tried to automate that retrieval of the cookie by doing the following:

headers_get = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.planning2.cityoflondon.gov.uk',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}

with requests.Session() as session:
    c = session.get('http://www.planning2.cityoflondon.gov.uk/online-applications/', headers = headers_get)
    headers['Cookie'] = 'JSESSIONID=' + list(c.cookies.get_dict().values())[0]
    r = session.post(url, headers = headers, data = data)
    with open (test_file, 'w') as file:
        file.write(r.text)
    webbrowser.open(test_file)

I would expect this to work as it is simply automating what i do manually: Go to the page of the GET request, get the cookie from it add said cookie to the headers dict of the POST request.

However I still receive the 'server error' page from the POST requests. Anyone would be able to get an understanding of why this happen?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
jim jarnac
  • 4,804
  • 11
  • 51
  • 88

1 Answers1

0

The requests.post accept cookies name parameter. Using it instead of sending cookies directly in header may fix the problem:

with requests.Session() as session:
    c = session.get('http://www.planning2.cityoflondon.gov.uk/online- applications/', headers = headers_get)
    # Also, you can set with cookies=session.cookies
    r = session.post(url, headers = headers, data = data, cookies=c.cookies)

Basically I suppose there may be some javascript logic on the site, which isn't executed with the use of requests.post. If that's the case, to resolve that you have to use selenium for filling and submitting form.

Please see Dynamic Data Web Scraping with Python, BeautifulSoup which has similar problem - javascript not executed.

Andriy Ivaneyko
  • 20,639
  • 6
  • 60
  • 82