2

Good afternoon,

Thanks for helping me out with this question.

I have a set of >5000 URLs within a list that I am interested in scraping. I have used lapply and readLines to extract the text for these webpages using the sample code below:

multipleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1407&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1975&start=1&labeltype=all")
multipleText <- lapply(multipleURL, readLines)

Now I would like to query each of these texts for the word "radioactive". I am simply interested in figuring out if this term is mentioned in the text and have been using the logical grep command:

radioactive <- grepl("radioactive" , multipleText, ignore.case = TRUE)

When I count the number of items in our list that contain the word "radioactive" it returns a count of 0:

count(radioactive)
      x freq
1 FALSE    3

However, a cursory review of the webpages for each of these URLs however reveals that the first link (http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all) DOES in fact contain the word radioactive. Our "multipleText" list even includes the word radioactive, although our grepl command doesn't seem to pick it up.

Any thoughts on what I am doing wrong would be greatly appreciated.

Many thanks,

Chris

Entropy
  • 378
  • 6
  • 16
  • DO you try to parse HTML with regular expression? Maybe you sould read [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – agstudy Jun 25 '13 at 23:28

1 Answers1

4

I think you should you parse your document using html parser. Here I am using XML package. I convert your document to an R list and then I can apply grep on it.

library(XML)
multipleText <- lapply(multipleURL,function(x) {
                       y <- xmlToList(htmlParse(x))
                       y.flat <- unlist(y,recursive=TRUE)
                       length(grep('radioactive',c(y.flat,names(y.flat))))
})

multipleText
[[1]]
[1] 8

[[2]]
[1] 0

[[3]]
[1] 0

EDIT to search for multi search :

## define your words here
WORDS <- c('CLINICAL ','solution','Action','radioactive','Effects')
library(XML)
multipleText <- lapply(multipleURL,
                       function(x) {
                         y <- xmlToList(htmlParse(x))
                         y.flat <- unlist(y,recursive=TRUE)
                         sapply(WORDS,function(y)     
                           length(grep(y,c(y.flat,names(y.flat)))))
                       })
do.call(rbind,multipleText)

     CLINICAL  solution Action radioactive Effects
[1,]         6       10      2           8       2
[2,]         1        3      1           0       3
[3,]         6       22      2           0       6

PS: maybe you should use ignore.case = TRUE for the grep command.

agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thank you -- this works brilliantly! Are you aware of a way to adapt this code to perform multiple grep queries without re-parsing the document? I have about a dozen words that I am looking for and using the above code for each of them I discovered that it was quite time-consuming to run each of these 12 times for each URL. Are you aware of a way to "nest" multiple grep queries and assign the outputs to different variables in the data frame? – Entropy Jun 26 '13 at 14:40