4

The problem

I have the following question: I need to search for some information about a company using the following link.

What I need to do with it is a search by entity name with search type being "begin with" drop down value. I also would like to see "All items" per page in the Display number of items to view part. For example, if I input "google" in the "Enter name" text box, the script should return a list of companies with names start with "google" (though this is just the starting point of what I want to do).

Question: How should I use Python to do this? I found the following thread: Using Python to ask a web page to run a search

I tried the example in the first answer, the code is put below:

from bs4 import BeautifulSoup as BS
import requests

protein='Q9D880'

text = requests.get('http://www.uniprot.org/uniprot/' + protein).text
soup = BS(text)
MGI = soup.find(name='a', onclick="UniProt.analytics('DR-lines', 'click', 'DR-MGI');").text
MGI = MGI[4:]
print protein +' - ' + MGI

The above code works because the UniPort website contains analytics, which takes those parameters. However,the website I am using doesn't have that.

I also tried to do the same thing as the first answer in this thread: how to submit query to .aspx page in python

However, the example code provide in the 1st answer does not work on my machine (Ubuntu 12.4 with Python 2.7). I am also not clear about which values should be there since I am dealing with a different aspx website.

How could I use Python to start a search with certain criteria (not sure this is proper web terminology, may be submit a form?) ?

I am from a C++ background and did not do any web stuff. I am also learning Python. Any help is greatly appreciated.

First EDIT:
With great help from @Kabie, I collected the following code (trying to understand how it works):

import requests
from lxml import etree

URL = 'http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSearch.aspx'

#With get_fields(), we fetched all <input>s from the form.
def get_fields():
    res = requests.get(URL)
    if res.ok:
        page = etree.HTML(res.text)
        fields = page.xpath('//form[@id="Form1"]//input')
        return { e.attrib['name']: e.attrib.get('value', '') for e in fields }

#hard code some selects from the Form
def query(data):
    formdata = get_fields()
    formdata.update({
        'ctl00$MainContent$ddRecordsPerPage':'25',
    }) # Hardcode some <select> value
    formdata.update(data)
    res = requests.post(URL, formdata)
    if res.ok:
        page = etree.HTML(res.text)
        return page.xpath('//table[@id="MainContent_SearchControl_grdSearchResultsEntity"]//tr')


def search_by_entity_name(entity_name, entity_search_type='B'):
    return query({
        'ctl00$MainContent$CorpSearch':'rdoByEntityName',
        'ctl00$MainContent$txtEntityName': entity_name,
        'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
    })

result = search_by_entity_name('google')

The above code is put in a script named query.py. I got the following error:

Traceback (most recent call last): File "query.py", line 39, in
result = search_by_entity_name('google')
File "query.py", line 36, in search_by_entity_name
'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
File "query.py", line 21, in query
formdata.update({
AttributeError: 'NoneType' object has no attribute 'update'

It seems to me that the search is not successful? Why?

Community
  • 1
  • 1
taocp
  • 23,276
  • 10
  • 49
  • 62
  • It means `soup.find(name='a', onclick="UniProt.analytics('DR-lines', 'click', 'DR-MGI');")` is returning `None` – karthikr Sep 17 '13 at 01:57
  • @karthikr Thanks for your reply. I updated the question a little bit. I understand a little better why the given code works. However, I don't know how to do similar things with a different website. Would you please point me to the right direction? – taocp Sep 17 '13 at 02:13
  • wow bounty for this question! – justhalf Sep 19 '13 at 02:31
  • Due to my very limited knowledge about web stuff, I think it is worthwhile to offer a bounty. Meanwhile, I have been trying to find a proper solution for a week, it is time to ask for help such that I can learn. – taocp Sep 19 '13 at 03:01
  • @icktoofay I tried that based on the second thread I mentioned, but no luck yet. – taocp Sep 19 '13 at 03:43
  • @taocp: Oops, I guess I didn't read the linked question. Sorry. – icktoofay Sep 19 '13 at 03:46
  • @icktoofay Nothing to be sorry for. Thanks for your comment anyway. – taocp Sep 19 '13 at 03:47

1 Answers1

5

You can inspect the page to find out all the fields need to be posted. There is a nice tutorial for Chrome DevTools. Other tools like FireBug on FireFox or DragonFly on Opera also do the work while I recommend DevTools.

After you post a query. In the Network panel, you can see the form data which actually been sent. In this case:

__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:5UILUho/L3O0HOt9WrIfldHD4Ym6KBWkQYI1GgarbgHeAdzM9zyNbcH0PdP6xtKurlJKneju0/aAJxqKYjiIzo/7h7UhLrfsGul1Wq4T0+BroiT+Y4QVML66jsyaUNaM6KNOAK2CSzaphvSojEe1BV9JVGPYWIhvx0ddgfi7FXKIwdh682cgo4GHmilS7TWcbKxMoQvm9FgKY0NFp7HsggGvG/acqfGUJuw0KaYeWZy0pWKEy+Dntb4Y0TGwLqoJxFNQyOqvKVxnV1MJ0OZ4Nuxo5JHmkeknh4dpjJEwui01zK1WDuBHHsyOmE98t2YMQXXTcE7pnbbZaer2LSFNzCtrjzBmZT8xzCkKHYXI31BxPBEhALcSrbJ/QXeqA7Xrqn9UyCuTcN0Czy0ZRPd2wabNR3DgE+cCYF4KMGUjMUIP+No2nqCvsIAKmg8w6Il8OAEGJMAKA01MTMONKK4BH/OAzLMgH75AdGat2pvp1zHVG6wyA4SqumIH//TqJWFh5+MwNyZxN2zZQ5dBfs3b0hVhq0cL3tvumTfb4lr/xpL3rOvaRiatU+sQqgLUn0/RzeKNefjS3pCwUo8CTbTKaSW1IpWPgP/qmCsuIovXz82EkczLiwhEZsBp3SVdQMqtAVcYJzrcHs0x4jcTAWYZUejvtMXxolAnGLdl/0NJeMgz4WB9tTMeETMJAjKHp2YNhHtFS9/C1o+Hxyex32QxIRKHSBlJ37aisZLxYmxs69squmUlcsHheyI5YMfm0SnS0FwES5JqWGm2f5Bh+1G9fFWmGf2QeA6cX/hdiRTZ7VnuFGrdrJVdbteWwaYQuPdekms2YVapwuoNzkS/A+un14rix4bBULMdzij25BkXpDhm3atovNHzETdvz5FsXjKnPlno0gH7la/tkM8iOdQwqbeh7sG+/wKPqPmUk0Cl0kCHNvMCZhrcgQgpIOOgvI2Fp+PoB7mPdb80T2sTJLlV7Oe2ZqMWsYxphsHMXVlXXeju3kWfpY+Ed/D8VGWniE/eoBhhqyOC2+gaWA2tcOyiDPDCoovazwKGWz5B+FN1OTep5VgoHDqoAm2wk1C3o0zJ9a9IuYoATWI1yd2ffQvx6uvZQXcMvTIbhbVJL+ki4yNRLfVjVnPrpUMjafsnjIw2KLYnR0rio8DWIJhpSm13iDj/KSfAjfk4TMSA6HjhhEBXIDN/ShQAHyrKeFVsXhtH5TXSecY6dxU+Xwk7iNn2dhTILa6S/Gmm06bB4nx5Zw8XhYIEI/eucPOAN3HagCp7KaSdzZvrnjbshmP8hJPhnFhlXdJ+OSYDWuThFUypthTxb5NXH3yQk1+50SN872TtQsKwzhJvSIJExMbpucnVmd+V2c680TD4gIcqWVHLIP3+arrePtg0YQiVTa1TNzNXemDyZzTUBecPynkRnIs0dFLSrz8c6HbIGCrLleWyoB7xicUg39pW7KTsIqWh7P0yOiHgGeHqrN95cRAYcQTOhA==
__SCROLLPOSITIONX:0
__SCROLLPOSITIONY:106
__VIEWSTATEENCRYPTED:
__EVENTVALIDATION:g2V3UVCVCwSFKN2X8P+O2SsBNGyKX00cyeXvPVmP5dZSjIwZephKx8278dZoeJsa1CkMIloC0D51U0i4Ai0xD6TrYCpKluZSRSphPZQtAq17ivJrqP1QDoxPfOhFvrMiMQZZKOea7Gi/pLDHx42wy20UdyzLHJOAmV02MZ2fzami616O0NpOY8GQz1S5IhEKizo+NZPb87FgC5XSZdXCiqqoChoflvt1nfhtXFGmbOQgIP8ud9lQ94w3w2qwKJ3bqN5nRXVf5S53G7Lt+Du78nefwJfKK92BSgtJSCMJ/m39ykr7EuMDjauo2KHIp2N5IVzGPdSsiOZH86EBzmYbEw==
ctl00$MainContent$hdnApplyMasterPageWitoutSidebar:0
ctl00$MainContent$hdn1:0
ctl00$MainContent$CorpSearch:rdoByEntityName
ctl00$MainContent$txtEntityName:GO
ctl00$MainContent$ddBeginsWithEntityName:M
ctl00$MainContent$ddBeginsWithIndividual:B
ctl00$MainContent$txtFirstName:
ctl00$MainContent$txtMiddleName:
ctl00$MainContent$txtLastName:
ctl00$MainContent$txtIdentificationNumber:
ctl00$MainContent$txtFilingNumber:
ctl00$MainContent$ddRecordsPerPage:25
ctl00$MainContent$btnSearch:Search Corporations
ctl00$MainContent$hdnW:1920
ctl00$MainContent$hdnH:1053
ctl00$MainContent$SearchControl$hdnRecordsPerPage:

What I post is Begin with 'GO'. This site is build with WebForms, so there are these long __VIEWSTATE and __EVENTVALIDATION fields. We need send them as well.

Now we are ready to make the query. First we need to get a blank form. The following code are written in Python 3.3, through I think they should still work on 2.x.

import requests
from lxml import etree

URL = 'http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSearch.aspx'

def get_fields():
    res = requests.get(URL)
    if res.ok:
        page = etree.HTML(res.text)
        fields = page.xpath('//form[@id="Form1"]//input')
        return { e.attrib['name']: e.attrib.get('value', '') for e in fields }

With get_fields(), we fetched all <input>s from the form. Note there are also <select>s, I will just hardcode them.

def query(data):
    formdata = get_fields()
    formdata.update({
        'ctl00$MainContent$ddRecordsPerPage':'25',
    }) # Hardcode some <select> value
    formdata.update(data)
    res = requests.post(URL, formdata)
    if res.ok:
        page = etree.HTML(res.text)
        return page.xpath('//table[@id="MainContent_SearchControl_grdSearchResultsEntity"]//tr')

Now we have a generic query function, lets make a wrapper for specific ones.

def search_by_entity_name(entity_name, entity_search_type='B'):
    return query({
        'ctl00$MainContent$CorpSearch':'rdoByEntityName',
        'ctl00$MainContent$txtEntityName': entity_name,
        'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
    })

This specific example site use a group of <radio> to determine which fields to be used, so 'ctl00$MainContent$CorpSearch':'rdoByEntityName' here is necessary. And you can make others like search_by_individual_name etc. by yourself.

Sometimes, website need more information to verify the query. By then you could add some custom headers like Origin, Referer, User-Agent to mimic a browser.

And if the website is using JavaScript to generate forms, you need more than requests. PhantomJS is a good tool to make browser scripts. If you want do this in Python, you can use PyQt with qtwebkit.

Update: It seems the website blocked our Python script to access it after yesterday. So we have to feign as a browser. As I mentioned above, we can add a custom header. Let's first add a User-Agent field to header see what happend.

res = requests.get(URL, headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
})

And now... res.ok returns True!

So we just need to add this header in both call res = requests.get(URL) in get_fields() and res = requests.post(URL, formdata) in query(). Just in case, add 'Referer':URL to the headers of the latter:

res = requests.post(URL, formdata, headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
    'Referer':URL,
})
Kabie
  • 10,489
  • 1
  • 38
  • 45
  • Thanks a lot for your detailed reply. I will try it tonight once I get home. – taocp Sep 19 '13 at 20:07
  • I met some issues when I tried your code, could you please help? I updated my question. thanks! – taocp Sep 20 '13 at 01:55
  • @taocp: It seems `res = requests.get(URL)` in `get_fields` failed. Maybe it got blocked. I'm updating the answer. – Kabie Sep 20 '13 at 03:03
  • I checked res.text, it says page not found, kind of strange. Thanks for your updates. – taocp Sep 20 '13 at 03:05
  • Yeah, it works. The next step is to pick the company in the returned table and go to the link. That's a separate question. Again, thank you very much! – taocp Sep 20 '13 at 03:28
  • I have a question, using your code, I should be directed to another page that contains the search result, right? However, what I got is the starting page :http://corp.sec.state.ma.us/corpweb/corpsearch/CorpSearch.aspx. When I used the search_by_entity function with given entity name, it should lead me to the result page, right? Like when I put a word in google and click search, it should give me a list of search results, but I did not get it, why? Would you please help? – taocp Sep 21 '13 at 02:01
  • For example, if you try `Google`, you should see a list of entities that starts with `Google` on a result page, but I did not get that. Sorry for duplicated comments, I was trying to make myself clear. – taocp Sep 21 '13 at 02:03
  • @taocp: It should return the result as in a browser. That depend on the websites. In this example, it will redirect to the form if no match is found, with a `span.ErrorMessage` shows `* No records found; try a new search using different criteria`. So if you search with exact match (`entity_search_type='M'`), there is a high chance found nothing. – Kabie Sep 21 '13 at 02:51
  • OK. But if I type `Google` on the entity name box, it does return a list of companies. Why there is a difference using the script and doing that manually? – taocp Sep 21 '13 at 04:00
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/37765/discussion-between-kabie-and-taocp) – Kabie Sep 21 '13 at 04:03