I want to count words in html articles using R. Scraping data like titles works nice and i was able to download the articles (code below). Now i want to count words in all of those articles, for example the word "Merkel".
It seems to be a bitcomplicated. I was able to make it work with the headlines (throw every headlines in 1 vector and count the words), but that was too detailed and too much code (because i had to throw the headlines for each month manually together if there are more than 1 page results in the search) and thats why i wont post all the code here ( i´m sure it can be easy but thats another problem).
I think i messed something up and thats why i couldn´t do the same with the html articles. The difference is that i scraped the titles direclty but the html files i had to download first.
So how can i go through my 10000 (here only 45) html pages and look for some nice keywoards? Example for january; I download the articles with this code;
library(xml2)
library(rvest)
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
link_nodes <- html_nodes(url_parsed1, css = ".entrylist__link")
html_links <- html_attr(link_nodes, "href")
getwd()
dir.create("html_articles")
setwd("html_articles")
for (url in html_links) {
newName <- paste (basename(url),".html")
download.file(url, destfile = newName)
}
Thanks a lot for your help!