0

I am using Rvest to scrape some data, if I print my URLs variable I get:

[32] "soccerstats.com/matches.asp?matchday=6"                                     

[33] "soccerstats.com/pmatch.asp?league=argentina&matchid=422&t1=5&t2=14&ly=2017" 

[34] "soccerstats.com/pmatch.asp?league=argentina&matchid=432&t1=23&t2=26&ly=2017"

[35] "soccerstats.com/pmatch.asp?league=argentina&matchid=425&t1=11&t2=10&ly=2017"

There is a mixture of URLs in the data set but I am only interested in the URLs that contain:

soccerstats.com/pmatch.asp?league=

I am trying to filter them by:

oversdf <- data.frame(URLs=URLs)

rownames(oversdf) # This returns 1,2,3,4 etc as expected

grep("^soccerstats.com/pmatch.asp?league",rownames(oversdf)) # This then doesn't return any results

Any ideas what I am doing wrong, I just want to return all URLs that contain a certain string only.

Cheers


library(rvest)

URL <- "http://www.soccerstats.com/matches.asp" #Feed page

WS <- read_html (URL) #reads webpage into WS variable

URLs <- WS %>% html_nodes ("a:nth-child(1)") %>% html_attr("href") %>% as.character() # Get the CSS nodes & extract the URLs

URLs <- paste0("http://www.soccerstats.com/",URLs)

oversdf <- data.frame(URLs=URLs)

rownames(oversdf) #returns a vector of row names in the overs data.frame:

grep("^pmatch.asp?league",oversdf$URLs)

Pete Stilgoe
  • 15
  • 1
  • 5
  • It would be easier if we could see your code properly https://stackoverflow.com/editing-help Meanwhile have you seen https://stackoverflow.com/questions/13043928/selecting-rows-where-a-column-has-a-string-like-hsa-partial-string-match – anotherfred Jun 22 '17 at 14:55
  • 1
    `grep("^soccerstats.com/pmatch.asp?league",rownames(oversdf))` is searching for the string in the rownames. Change `rownames(oversdf)` to something like `oversdf$URLs`. The idea of the `grep` is to *return* the rownames which you then use as arguments to your data frame. – anotherfred Jun 22 '17 at 15:24
  • Hi @anotherfred, thanks for looking, I have tried the change above but its still returning no results, see full code: – Pete Stilgoe Jun 23 '17 at 09:48
  • library(rvest)

    URL <- "http://www.soccerstats.com/matches.asp" #Feed page WS <- read_html (URL) #reads webpage into WS variable URLs <- WS %>% html_nodes ("a:nth-child(1)") %>% html_attr("href") %>% as.character() # Get the CSS nodes & extract the URLs URLs <- paste0("http://www.soccerstats.com/",URLs) oversdf <- data.frame(URLs=URLs) rownames(oversdf) #returns a vector of row names in the overs data.frame: grep("^http://www.soccerstats.com/pmatch.asp?league",oversdf$URLs)
    – Pete Stilgoe Jun 23 '17 at 09:48
  • @anotherfred I have posted the code in the original question above as cant seem to format in the comment section. – Pete Stilgoe Jun 23 '17 at 10:00
  • Looking at other forum posts this line should work, but doesnt in this case for some reason: grep("^pmatch.asp?league", oversdf$URLs) – Pete Stilgoe Jun 23 '17 at 10:27
  • You need to escape the '?' in your string. It has a special meaning in regular expressions.Put '\\' before it. (One '\' for R and one for grep - this is not the normal escape https://stackoverflow.com/questions/10602433/how-to-escape-a-question-mark-in-r) – anotherfred Jun 23 '17 at 13:59
  • 1
    That works, thanks :) – Pete Stilgoe Jun 25 '17 at 13:03

0 Answers0