I am trying to create a Python script that will complete a search at the DOE website with specific search parameters (Institution Name like: and Most Recent Award Date:) so that I can then parse the data later in my script. If this returns multiple pages, I will need to get the data from each page. I cannot figure out how to get the site to return any search results at all.
I found this StackOverflow response, which seems like exactly what I need, but when I run the following code:
import requests
from lxml import etree
URL = 'https://pamspublic.science.energy.gov/WebPAMSExternal/Interface/Awards/AwardSearchExternal.aspx'
def get_fields():
res = requests.get(URL)
if res.ok:
page = etree.HTML(res.text)
fields = page.xpath('//form[@id="aspnetForm"]//input')
return { e.attrib['name']: e.attrib.get('value', '') for e in fields }
get_fields()
I get this error:
Traceback (most recent call last):
File "/home/austin/repos/funding-scraper/doe_scraper.py", line 15, in <module>
get_fields()
File "/home/austin/repos/funding-scraper/doe_scraper.py", line 13, in get_fields
return { e.attrib['name']: e.attrib.get('value', '') for e in fields }
File "/home/austin/repos/funding-scraper/doe_scraper.py", line 13, in <dictcomp>
return { e.attrib['name']: e.attrib.get('value', '') for e in fields }
File "src/lxml/etree.pyx", line 2497, in lxml.etree._Attrib.__getitem__
KeyError: 'name'
EDIT1:
An example query with specific search parameters:
Institution name like:
University of Texas
Most Recent Award Date:
Between:
1/1/2023
and:
1/31/2023
I don't know what the exact response would look like, but it should include results from this search that contain multiple html/json/xml fields for each result entry (e.g. Award Number, Title, Institution, Amount Awarded to Date, etc.)
EDIT2:
After much trial and error, I pieced together a half-solution:
import requests
from bs4 import BeautifulSoup
URL = 'https://pamspublic.science.energy.gov/WebPAMSExternal/Interface/Awards/AwardSearchExternal.aspx'
def get_fields():
res = requests.get(URL)
if res.ok:
soup = BeautifulSoup(res.content, 'html.parser')
script_manager = soup.find(attrs={"name": "ctl00_REIRadScriptManager1_TSM"})['value']
viewstate = soup.find(attrs={"name": "__VIEWSTATE"})['value']
my_dict = {
"ctl00_REIRadScriptManager1_TSM": script_manager,
"__VIEWSTATE": viewstate,
"ctl00$MainContent$pnlSearch$txtInstitutionName": "University of Texas",
"ctl00$MainContent$pnlSearch$dpAwardDateFrom$dateInput": "1/1/2023",
"ctl00$MainContent$pnlSearch$dpAwardDateTo$dateInput": "1/31/2023"
}
return my_dict
def query():
formdata = get_fields()
res = requests.post(URL, formdata)
if res.ok:
soup = BeautifulSoup(res.content, 'html.parser')
with open('results.html', 'w') as results:
results.write(str(soup))
query()
This creates a parse-able document (which I will figure out later) that includes search results. However, it is not running the DateFrom or DateTo inputs, and it's returning only the first 15 results. Any help on adding these parameters to my post request would be appreciated!