Counting words in html documents

Question

I want to count words in html articles using R. Scraping data like titles works nice and i was able to download the articles (code below). Now i want to count words in all of those articles, for example the word "Merkel".

It seems to be a bitcomplicated. I was able to make it work with the headlines (throw every headlines in 1 vector and count the words), but that was too detailed and too much code (because i had to throw the headlines for each month manually together if there are more than 1 page results in the search) and thats why i wont post all the code here ( i´m sure it can be easy but thats another problem).

I think i messed something up and thats why i couldn´t do the same with the html articles. The difference is that i scraped the titles direclty but the html files i had to download first.

So how can i go through my 10000 (here only 45) html pages and look for some nice keywoards? Example for january; I download the articles with this code;

library(xml2)
library(rvest)
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
link_nodes <- html_nodes(url_parsed1, css = ".entrylist__link")
html_links <- html_attr(link_nodes, "href")
getwd()
dir.create("html_articles")
setwd("html_articles")
for (url in html_links) {
 newName <- paste (basename(url),".html")
download.file(url, destfile = newName)
}

Thanks a lot for your help!

quant · Accepted Answer · 2018-01-04T12:47:30.667

I hope i understood your question correctly:

library(xml2)
library(rvest)
library(XML)
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
link_nodes <- html_nodes(url_parsed1, css = ".entrylist__link")
html_links <- html_attr(link_nodes, "href")
getwd()
dir.create("html_articles")
setwd("html_articles")
for (url_org in html_links) { 
  # url_org <- html_links[1]
  newName <- paste (basename(url_org),".html")

  download.file(url_org, destfile = newName)
  # Read and parse HTML file
  doc.html <- htmlTreeParse(url_org,
                useInternal = TRUE)
  # Extract all the paragraphs (HTML tag is p, starting at
  # the root of the document). Unlist flattens the list to
  # create a character vector.
  doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
  # Replace all \n by spaces
  doc.text = gsub('\\n', ' ', doc.text)

  # Join all the elements of the character vector into a single
  # character string, separated by spaces
  doc.text = paste(doc.text, collapse = ' ')
  # count the occurences of the word "Merkel in that hmtl
  str_count(doc.text,"Merkel")
}

I would like to pass the credits to here and here

Thanks a lot, that is already quite helpful! But why do you throw it into the for loop? That would require to download everything again and again if we want to check some words.. Biggest problem now is to navigate through the already downloaded articles. And also: Does the code work for you? It counts the occurences of the word "merkel" in the title (=1), not in the whole html and url_org only contains one article, so we would need another loop to iterate through all articles in the folder. I know how to do this in java, but in R... — matt, Jan 03 '18 at 16:50
" length(grep("Merkel", doc.text))" seems not to work fine. str_count(doc.text,"Merkel") solves that problem. now a loop is missing to navigate through a folder and look in each html file and add the amount up to a sum — matt, Jan 03 '18 at 17:11

Counting words in html documents

1 Answers1