1

I found this post over here that shows how to save the text from a website. Is there a simple way in R to extract only the text elements of an HTML page?.

I tried one of the answers provided here and it seems to be working quite well! For example:

library(htm2txt)
url_1 <- 'https://en.wikipedia.org/wiki/Alan_Turing'
text_1 <- gettxt(url_1)

url_2 <- 'https://www.bbc.com/future/article/20220823-how-auckland-worlds-most-spongy-city-tackles-floods'
text_2 <- gettxt(url_2)

All the text from the article appears, but so does a lot of "extra text" which does not have any meaning. For example:

p. 40/03B\n• ^ a or identifiers\n• Articles with GND identifiers\n• Articles with ICCU identifiers\n•

  • Is there some standard way to only keep the actual text from these articles? Or does this depend too much on the individual structure of the website and no "one size fits all" solution exists for such a problem?

  • Perhaps there might be some method of doing this in R that only recognizes the "actual text"?

Thank you!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

2

You can cross-reference the words from the HTML page with a dictionary from qdapDictionaries, so only real English words are kept, but this method does keep words that aren't exclusively from the article (e.g., the word "jump" from "Jump to navigation").

library(tidyverse)
library(htm2txt)
library(quanteda)
library(qdapDictionaries)
data(DICTIONARY)

text <- 'https://en.wikipedia.org/wiki/Alan_Turing' %>% gettxt() %>% corpus()
text <- tokens(text, remove_punct = TRUE, remove_numbers = TRUE)
text <- tokens_select(text, DICTIONARY$word)
text <- data.frame(text = sapply(text, as.character), stringsAsFactors = FALSE) %>%
  group_by(text1 = tolower(text)) %>%
  table() %>% as.data.frame() %>%
  rename(word = text1) %>%
  rename(frequency = Freq)
head(text)
jrcalabrese
  • 2,184
  • 3
  • 10
  • 30