I am trying to scrape the content of this website with rvest
(not the linked papers/abstracts, just the number, title, authors, etc.).
Per default, the page displays 2016 papers only and scraping the 2016 data was 'no problem'. I was hoping the URL would change after changing "2016" to "all years", but it remains the same. So I resorted to html_form
. Upon inspecting "resources" of the webpage, I found the relevant input name to be filteryear
.
R-code:
library(rvest)
rdc <- html_session("https://sfb649.wiwi.hu-berlin.de/fedc/discussionPapers_formular_content.php")
form <- html_form(rdc)
form <- set_values(form, filteryear = "all years")
#Error: Unknown field names: filteryear
So apparently, filteryear
is not part of the form. With the limited HTML-knowledge I have, I am pretty sure the below tells me, that the form consists of three inputs: filterTypeName
, filterName
and filteryear
.
HTML from resource:
<form action='discussionPapers_formular_content.php' method='post'>
<select name='filterTypeName'>
<option value='AUTHORS'>Author</option>
<option value='PROJECT'>Project Code</option>
...
<option value='JEL'>JEL</option
</select> </td> # Is this </td> the problem?!
<td valign='baseline'>
<input type='text' size='35' name='filterName' >
</td>
<td valign='baseline'>
<select name='filteryear'>
<option value='2005'>2005</option>
...
<option value='2016'>2016</option>
<option value='all'>all years</option>
</select>
</td>
<td valign='baseline'>
<INPUT type='submit' value='Search' name='B1'></INPUT>
</td></tr>
</form>
Why is html_form
not recognising this form completely? And, more importantly, is there a way to solve this problem?