Information of a webpage using R

Question

How can I save a text, which is inside a webpage, in a variable, and then find (specify) words (of the text) that have more than 9 letters:

web_page <- readLines("en.neyshabur.ac.ir/en/119-about-city-of-neyshabur/1232-city-of-neyshabur")

With this code, I count the numbers of words inside the text of the webpage:

sum(sapply(strsplit(web_page," "),length))

But I do not know how to find words (of the text) that have more than 9 letters??

How can I provide my submitted question as a minimal reproducible example? — Rojer, Apr 27 '20 at 17:24
Here is my code. With this code I save a webpage in a variable: web_page <- readLines("http://en.neyshabur.ac.ir/en/119-about-city-of-neyshabur/1232-city-of-neyshabur") With this code, I count the numbers of words inside the text of the webpage: sum(sapply(strsplit(web_page," "),length)) but I do not know how to find words (of the text) that have more than 9 letters?? — Rojer, Apr 27 '20 at 17:36
en.neyshabur.ac.ir/en/119-about-city-of-neyshabur/1232-city-of-neyshabur — Rojer, Apr 27 '20 at 18:36
A guide to asking good questions can be found here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 — MrGumble, Apr 28 '20 at 08:47

score 1 · Answer 1 · answered Apr 28 '20 at 08:56

You have fallen into an entirely new world of things that are not what you think they are. Welcome to R.

Firstly, when you run the line

web_page <- readLines("en.neyshabur.ac.ir/en/119-about-city-of-neyshabur/…)

(verbatim from your question) you will receive an error. This is due to

a) you did not paste the entire line into the question to start with, so part of the url is truncated and replaced with ellipses ("..."),

b) the closing quotation (") is missing,

c) readLines thinks the url is actually a local file.

You and I know that it is in fact a URL, but you will have to explicitly tell readLines to use the http-protocol. You do so by using the actual URL that starts with http://.

Next obstacle is the string contents. Go ahead and try print(head(web_page)) - what you see is the HTML structure of the page, including the text you want to work with. You can view in full if you open the URL in a browser, right-click somewhere and select "View source" (or similar). You will now need to extract the relevant text from all that HTML structure.

I suggest you google "web scraping r" and read up on some of the tutorials on how to deal with extracting information from webpages.

score 0 · Answer 2 · answered Apr 28 '20 at 09:24

Here is a first draft which is working in my case. I guess this is not 100% what you need, but it should show, how to get the desired results.

This shows what you code does:

url_name <- "http://en.neyshabur.ac.ir/en/119-about-city-of-neyshabur/1232-city-of-neyshabur"

web_page <- readLines(url_name)
sum(sapply(strsplit(web_page," "),length)) 
t <- web_page[1:2]
tt <- unlist(strsplit(t[1:2]," "))
len <- nchar(unlist(strsplit(tt," ")))
cond <- len > 9
tt[cond]

If you replace the last part by this

library(RCurl)
library(RTidyHTML)
library(XML)
doc.raw <- RCurl::getURL(url_name)
doc <- htmltidy::tidy_html(doc.raw)
html <- XML::htmlTreeParse(doc, useInternal = TRUE)
txt <- XML::xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]")
print(unlist(txt[[282]]))

it should get close. Hope that helps.

Information of a webpage using R

2 Answers2