1

I have been trying to search a website and collect urls to the articles it yields, but I have run into problems I don't understand. I have read the following on web forms, and I believe that my problem arises because I am trying to submit prebuilt link to the search results rather than generating them with the web form.

There are several similar questions on here already, but none I've found which deal with how to deconstruct the html and what steps are needed to find and submit a web form.

When researching how to submit web forms, I found out about selenium, which I can't find working python 3 examples of, nor good documentation for. I found on another SO question a codeproject link which has gotten me the most progress so far. code featured below- https://www.codeproject.com/Articles/873060/Python-Search-Youtube-for-Video

That said, I don't yet understand why it works or what variables I will need to change in order to harvest results from another website. Namely, the website I'm looking to search is: https://globenewswire.com/Search

So my question is this; Whether by web form submission or by proper url formation, how can I obtain the search results html?

Here is the code I had been using to formulate the post search url:

name=input()
name=name.replace(' ','%20')
url='https://globenewswire.com/Search/NewsSearch?keyword='+name+'#'

Here is the code featured on the code project link:

import urllib.request
import urllib.parse
import re

query_string = urllib.parse.urlencode({"search_query" : input()})
html_content = urllib.request.urlopen("http://www.youtube.com/results?" + query_string)
search_results = re.findall(r'href=\"\/watch\?v=(.{11})', html_content.read().decode())
print("http://www.youtube.com/watch?v=" + search_results[0])

Edit:

After having captured my request using chrome's dev tools, I now have the response headers and the following curl:

curl "https://globenewswire.com/Search" -H "cookie:somecookie" -H "Origin: https://globenewswire.com" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Cache-Control: max-age=0" -H "Referer: https://globenewswire.com/Search" -H "Connection: keep-alive" -H "DNT: 1" --data "__RequestVerificationToken=xY^%^2BkRoeEL8DswUTDlUVEWUCSxnRzX5Ax2Z^%^2FNCTa0lBNfqOFaU2eb^%^2FTD8XqENnf8d2Ghtm1taW8Cu0BvWrC1dh^%^2BdKZVgHyC6HM0EEm7mupQe1UZ7pHrF9GhnpwwcXR0dyJ^%^2B91Ng^%^3D^%^3D^&quicksearch-textbox=Abeona+Therapeutics" --compressed

As well as the request headers:

POST /Search HTTP/1.1
Host: globenewswire.com
Connection: keep-alive
Content-Length: 217
Cache-Control: max-age=0
Origin: https://globenewswire.com
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
DNT: 1
Referer: https://globenewswire.com/Search
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
suscat
  • 61
  • 8
  • Try capturing the exact request your browser makes (e.g. in Chrome you can use dev tools for that), then replicate it. – Norrius Feb 20 '18 at 19:14
  • Thanks for your input, You've revealed a lot to me I'd never seen before. I've dug through the dev tools. I think I've found it, though, I'm not sure how to replicate it. Would you mind taking a look at my edit and seeing if I'm on the right track? – suscat Feb 20 '18 at 20:43
  • Looks good (I'd be cautious posting your actual cookies though!), now you need to `POST` your form data and you should have the results. The headers might be unnecessary, but some websites check them to see that you're not a bot (ha!). While searching for posting in `urllib` I actually found [an answer](https://stackoverflow.com/a/36485152/1983772) that does pretty much what you want, check that out. – Norrius Feb 20 '18 at 20:51
  • Thanks my dude, you've really given me a lot to go off of. Now all I've got to do is get around this 500 error. I'm guessing its because I'm not handling the `__RequestVerificationToken` properly. – suscat Feb 20 '18 at 21:47

0 Answers0