How can we extract information from subdomain using Rcrawler in R?

Question

I want to extract content of webpage from the subdomain using main URL.

I tried using Rcrawler

library(Rcrawler)

Rcrawler(Website = "http://www.xbyte-technolabs.com/", no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))

After running this code I got INDEX default variable and we can see all URL of website. There is one URL ""http://xbyte-technolabs.com/contact_us.php" I want to extract contact details from it.

Now can someone please guide me how can I go to this particular URL from main URL ""http://xbyte-technolabs.com/" using Rcrawler in R.

Premal · Answer 1 · 2017-12-22T07:55:13.093

library(Rcrawler)

Rcrawler("http://www.xbyte-technolabs.com/",no_cores = 4,no_conn = 4)

for (i in length(INDEX)) {
  for (j in nrow(INDEX)) {

    Rcrawler(Website = INDEX[[i]][j], no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))

  }

}
#Rcrawler(Website = INDEX[[i]][23], no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))
class(DATA)
head(DATA)

ad <- DATA[[1]]
ad <- as.character(ad)
cat(ad)

Sorry I think something wrong with this code Anyone get following Error:

Error in strsplit(gsub("http://|https://|www\.", "", Website), "/")[[c(1, : subscript out of bounds

Otto Kässi · Accepted Answer · 2017-12-22T15:07:12.010

0

library(Rcrawler)
Rcrawler(Website = "http://www.xbyte-technolabs.com/", no_cores = 1, no_conn = 1, ExtractCSSPat = c(".address"))

pageid <- as.numeric(INDEX$Id[INDEX$Url == 'http://xbyte-technolabs.com/contact_us.php'])
DATA[pageid]

According to ?Rcrawler, Rcrawler creates two global variables

INDEX: A data frame in global environement representing the generic URL index,including the list of fetched URLs and page details (contenttype,HTTP state, number of out-links and in-links, encoding type, and level), and

DATA: A list of lists in global environement holding scraped contents.

The Id variable in INDEX, corresponds to the list element in DATA. The code snippet above looks for the Id corresponding to the url you are interested in.

Sidenote: since you know the URL you are looking for, crawling through the whole website seems like an overkill.

edited Dec 22 '17 at 15:07

answered Dec 22 '17 at 07:40

Otto Kässi

2,943
1
10
27

1

While this code snippet may be the solution, [including an explanation](//meta.stackexchange.com/questions/114762/explaining-entirely-‌code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – yivi Dec 22 '17 at 14:13
1

@yivi I added a bit of explanation to my response. Cheers! – Otto Kässi Dec 22 '17 at 15:07
@OttoKässi Thank you so much for your Answer it will helpful me in Extract data from main URL. – Premal Dec 23 '17 at 05:40
@OttoKässi I want to do globally. If I input Main URL it will run the code and give me contact details of client. so I am trying to do in R – Premal Dec 23 '17 at 05:59

How can we extract information from subdomain using Rcrawler in R?

2 Answers2