I have been trying to search a website and collect urls to the articles it yields, but I have run into problems I don't understand. I have read the following on web forms, and I believe that my problem arises because I am trying to submit prebuilt link to the search results rather than generating them with the web form.
There are several similar questions on here already, but none I've found which deal with how to deconstruct the html and what steps are needed to find and submit a web form.
When researching how to submit web forms, I found out about selenium, which I can't find working python 3 examples of, nor good documentation for. I found on another SO question a codeproject link which has gotten me the most progress so far. code featured below- https://www.codeproject.com/Articles/873060/Python-Search-Youtube-for-Video
That said, I don't yet understand why it works or what variables I will need to change in order to harvest results from another website. Namely, the website I'm looking to search is: https://globenewswire.com/Search
So my question is this; Whether by web form submission or by proper url formation, how can I obtain the search results html?
Here is the code I had been using to formulate the post search url:
name=input()
name=name.replace(' ','%20')
url='https://globenewswire.com/Search/NewsSearch?keyword='+name+'#'
Here is the code featured on the code project link:
import urllib.request
import urllib.parse
import re
query_string = urllib.parse.urlencode({"search_query" : input()})
html_content = urllib.request.urlopen("http://www.youtube.com/results?" + query_string)
search_results = re.findall(r'href=\"\/watch\?v=(.{11})', html_content.read().decode())
print("http://www.youtube.com/watch?v=" + search_results[0])
Edit:
After having captured my request using chrome's dev tools, I now have the response headers and the following curl:
curl "https://globenewswire.com/Search" -H "cookie:somecookie" -H "Origin: https://globenewswire.com" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Cache-Control: max-age=0" -H "Referer: https://globenewswire.com/Search" -H "Connection: keep-alive" -H "DNT: 1" --data "__RequestVerificationToken=xY^%^2BkRoeEL8DswUTDlUVEWUCSxnRzX5Ax2Z^%^2FNCTa0lBNfqOFaU2eb^%^2FTD8XqENnf8d2Ghtm1taW8Cu0BvWrC1dh^%^2BdKZVgHyC6HM0EEm7mupQe1UZ7pHrF9GhnpwwcXR0dyJ^%^2B91Ng^%^3D^%^3D^&quicksearch-textbox=Abeona+Therapeutics" --compressed
As well as the request headers:
POST /Search HTTP/1.1
Host: globenewswire.com
Connection: keep-alive
Content-Length: 217
Cache-Control: max-age=0
Origin: https://globenewswire.com
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
DNT: 1
Referer: https://globenewswire.com/Search
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9