Good afternoon,
Thanks for helping me out with this question.
I have a set of >5000 URLs within a list that I am interested in scraping. I have used lapply and readLines to extract the text for these webpages using the sample code below:
multipleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1407&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1975&start=1&labeltype=all")
multipleText <- lapply(multipleURL, readLines)
Now I would like to query each of these texts for the word "radioactive". I am simply interested in figuring out if this term is mentioned in the text and have been using the logical grep command:
radioactive <- grepl("radioactive" , multipleText, ignore.case = TRUE)
When I count the number of items in our list that contain the word "radioactive" it returns a count of 0:
count(radioactive)
x freq
1 FALSE 3
However, a cursory review of the webpages for each of these URLs however reveals that the first link (http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all) DOES in fact contain the word radioactive. Our "multipleText" list even includes the word radioactive, although our grepl command doesn't seem to pick it up.
Any thoughts on what I am doing wrong would be greatly appreciated.
Many thanks,
Chris