The problem
I have the following question: I need to search for some information about a company using the following link.
What I need to do with it is a search by entity name
with search type
being "begin with" drop down value. I also would like to see "All items" per page in the Display number of items to view
part. For example, if I input "google" in the "Enter name" text box, the script should return a list of companies with names start with "google" (though this is just the starting point of what I want to do).
Question: How should I use Python to do this? I found the following thread: Using Python to ask a web page to run a search
I tried the example in the first answer, the code is put below:
from bs4 import BeautifulSoup as BS
import requests
protein='Q9D880'
text = requests.get('http://www.uniprot.org/uniprot/' + protein).text
soup = BS(text)
MGI = soup.find(name='a', onclick="UniProt.analytics('DR-lines', 'click', 'DR-MGI');").text
MGI = MGI[4:]
print protein +' - ' + MGI
The above code works because the UniPort
website contains analytics
, which takes those parameters. However,the website I am using doesn't have that.
I also tried to do the same thing as the first answer in this thread: how to submit query to .aspx page in python
However, the example code provide in the 1st answer does not work on my machine (Ubuntu 12.4 with Python 2.7). I am also not clear about which values should be there since I am dealing with a different aspx website.
How could I use Python to start a search with certain criteria (not sure this is proper web terminology, may be submit a form?) ?
I am from a C++ background and did not do any web stuff. I am also learning Python. Any help is greatly appreciated.
First EDIT:
With great help from @Kabie, I collected the following code (trying to understand how it works):
import requests
from lxml import etree
URL = 'http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSearch.aspx'
#With get_fields(), we fetched all <input>s from the form.
def get_fields():
res = requests.get(URL)
if res.ok:
page = etree.HTML(res.text)
fields = page.xpath('//form[@id="Form1"]//input')
return { e.attrib['name']: e.attrib.get('value', '') for e in fields }
#hard code some selects from the Form
def query(data):
formdata = get_fields()
formdata.update({
'ctl00$MainContent$ddRecordsPerPage':'25',
}) # Hardcode some <select> value
formdata.update(data)
res = requests.post(URL, formdata)
if res.ok:
page = etree.HTML(res.text)
return page.xpath('//table[@id="MainContent_SearchControl_grdSearchResultsEntity"]//tr')
def search_by_entity_name(entity_name, entity_search_type='B'):
return query({
'ctl00$MainContent$CorpSearch':'rdoByEntityName',
'ctl00$MainContent$txtEntityName': entity_name,
'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
})
result = search_by_entity_name('google')
The above code is put in a script named query.py
. I got the following error:
Traceback (most recent call last): File "query.py", line 39, in
result = search_by_entity_name('google')
File "query.py", line 36, in search_by_entity_name
'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
File "query.py", line 21, in query
formdata.update({
AttributeError: 'NoneType' object has no attribute 'update'
It seems to me that the search is not successful? Why?