-4

I understand in general how to make a POST request using urllib2 (encoding the data, etc.), but the problem is all the tutorials online use completely useless made-up example urls to show how to do it (someserver.com, coolsite.org, etc.), so I can't see the specific html that corresponds to the example code they use. Even python.org's own tutorial is totally useless in this regard.

I need to make a POST request to this url:

https://patentscope.wipo.int/search/en/search.jsf

The relevant part of the code is this (I think):

<form id="simpleSearchSearchForm" name="simpleSearchSearchForm" method="post" action="/search/en/search.jsf" enctype="application/x-www-form-urlencoded" style="display:inline">
<input type="hidden" name="simpleSearchSearchForm" value="simpleSearchSearchForm" />
<div class="rf-p " id="simpleSearchSearchForm:sSearchPanel" style="text-align:left;z-index:-1;"><div class="rf-p-hdr " id="simpleSearchSearchForm:sSearchPanel_header">

Or maybe it's this:

<input id="simpleSearchSearchForm:fpSearch" type="text" name="simpleSearchSearchForm:fpSearch" class="formInput" dir="ltr" style="width: 400px; height: 15px; text-align: left; background-image: url(&quot;https://patentscope.wipo.int/search/org.richfaces.resources/javax.faces.resource/org.richfaces.staticResource/4.5.5.Final/PackedCompressed/classic/org.richfaces.images/inputBackgroundImage.png&quot;); background-position: 1px 1px; background-repeat: no-repeat;">

If I want to encode JP2014084003 as the search term, what is the corresponding value in the html to use? input id? name? value?

Addendum: this answer does not answer my question, because it just repeats the information I've already looked at in the python docs page.

UPDATE:

I found this, and tried out the code in there, specifically:

import requests

headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'name':'simpleSearchSearchForm:fpSearch','value':'2014084003'}
link    = 'https://patentscope.wipo.int/search/en/search.jsf'
session = requests.Session()
resp    = session.get(link,headers=headers)
cookies = requests.utils.cookiejar_from_dict(requests.utils.dict_from_cookiejar(session.cookies))
resp    = session.post(link,headers=headers,data=payload,cookies =cookies)

r = session.get(link)

f = open('htmltext.txt','w')

f.write(r.content)

f.close()

I get a successful response (200) but the data, once again is simply the data in the original page, so I don't know whether I'm posting to the form correctly and there's something else I need to do to get it to return the data from the search results page, or if I'm still posting the data wrong.

And yes, I realize that this uses requests instead of urllib2, but all I want to be able to do is get the data.

Community
  • 1
  • 1
Marc Adler
  • 534
  • 1
  • 4
  • 16
  • The search is sent as form data, not as part of the URL. The key for the query text is `simpleSearchSearchForm:fpSearch`. Note that you can see this using the developer tools in your browser when you submit a search manually. – jonrsharpe May 15 '16 at 17:43
  • Okay. What does that mean? By which I mean: what should I do to post the data and retrieve the page I need? – Marc Adler May 15 '16 at 17:44
  • https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Forms/Sending_and_retrieving_form_data – jonrsharpe May 15 '16 at 17:45
  • I read the page, but it doesn't contain the information I need. – Marc Adler May 15 '16 at 17:48
  • Then head to google and find somewhere that does - in the tutorial you've already apparently read, see https://docs.python.org/2/howto/urllib2.html#data – jonrsharpe May 15 '16 at 17:49
  • Shall I repeat my question? I don't know which parts of the html correspond to what values I need to send. Made-up examples using invented code from nonexistent websites is no help whatsoever. – Marc Adler May 15 '16 at 18:02
  • @MarcAdler: If I understand your question, in an HTML, the values of the attributes: "name" and "value" are what you want. In your above HTML, the POST data (assuming a %-encoding) will be sent as: `simpleSearchSearchForm:fpSearch=TheQueryYouTypedIntoTextbox&simpleSearchSearchForm=simpleSearchSearchForm`. Does that answer your question? (I am guessing `TheQueryYouTypedIntoTextbox` corresponds to `JP2014084003`) – UltraInstinct May 15 '16 at 18:38
  • It might. Are you saying that I need to do this? `url = 'https://patentscope.wipo.int/search/en/search.jsf' values = {'name' : 'JP2014084003', 'value' : 'simpleSearchSearchForm'}` – Marc Adler May 15 '16 at 18:42
  • No, that didn't work. It just returned the exact same page. If I reverse them (`name` = `simpleSearchSearchForm` etc.) then I do get a different result, but the info is all `NUL` and then only a tiny bit of html at the end. Either way, not what I need. – Marc Adler May 15 '16 at 18:49
  • Try `payload = {'simpleSearchSearchForm:fpSearch':'2014084003'}`; name refers to the key and value to the value relating to that key, you don't include that in the payload. – jonrsharpe May 15 '16 at 21:32
  • Thanks for that. That doesn't work, but that gives me a big clue about what I wasn't understanding about how the keys relate to the values. – Marc Adler May 16 '16 at 02:34

1 Answers1

3

This is not the most straight forward post request, if you look in developer tools or firebug you can see the formdata from a successful browser post:

enter image description here

All that is pretty straight forward bar the fact you see some : embedded in the keys which may be a bit confusing, simpleSearchSearchForm:commandSimpleFPSearch is the key and Search.

The only thing that you cannot hard code is javax.faces.ViewState, we need to make a request to the site and then parse that value which we can do with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = "https://patentscope.wipo.int/search/en/search.jsf"

data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
        "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
        "simpleSearchSearchForm:fpSearch": "automata",
        "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
        "simpleSearchSearchForm:j_idt406": "workaround"}
head = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

with requests.Session() as s:
    # Get the cookies and the source to parse the Viewstate token
    init = s.get(url)
    soup = BeautifulSoup(init.text, "lxml")
    val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
    # update post data dict
    data["javax.faces.ViewState"] = val
    r = s.post(url, data=data, headers=head)
    print(r.text)

If we run the code above:

In [13]: import requests

In [14]: from bs4 import BeautifulSoup

In [15]: url = "https://patentscope.wipo.int/search/en/search.jsf"

In [16]: data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
   ....:         "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
   ....:         "simpleSearchSearchForm:fpSearch": "automata",
   ....:         "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
   ....:         "simpleSearchSearchForm:j_idt406": "workaround"}

In [17]: head = {
   ....:     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

In [18]: with requests.Session() as s:
   ....:         init = s.get(url)
   ....:         soup = BeautifulSoup(init.text, "lxml")
   ....:         val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
   ....:         data["javax.faces.ViewState"] = val
   ....:         r = s.post(url, data=data, headers=head)
   ....:         print("\n".join([s.text.strip() for s in BeautifulSoup(r.text,"lxml").select("span.trans-section")]))
   ....:     

Fuzzy genetic learning automata classifier
Fuzzy genetic learning automata classifier
FINITE AUTOMATA MANAGER
CELLULAR AUTOMATA MUSIC GENERATOR
CELLULAR AUTOMATA MUSIC GENERATOR
ANALOG LOGIC AUTOMATA
Incremental automata verification
Cellular automata music generator
Analog logic automata
Symbolic finite automata

You will see it matches the webpage. If you want to scrape sites you need to get familiar with developer tools/firebug etc.. to watch how the requests are made and then try to mimic. To get firebug open, right click on the page and select inspect element, click the network tab and submit your request. You just have to select the requests from the list then select whatever tab you want info on i.e params for out post request:

enter image description here

You may also find this answer useful on how to approach posting to a site.

Community
  • 1
  • 1
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321