I am using Rvest
to scrape some data, if I print my URLs variable I get:
[32] "soccerstats.com/matches.asp?matchday=6"
[33] "soccerstats.com/pmatch.asp?league=argentina&matchid=422&t1=5&t2=14&ly=2017"
[34] "soccerstats.com/pmatch.asp?league=argentina&matchid=432&t1=23&t2=26&ly=2017"
[35] "soccerstats.com/pmatch.asp?league=argentina&matchid=425&t1=11&t2=10&ly=2017"
There is a mixture of URLs in the data set but I am only interested in the URLs that contain:
soccerstats.com/pmatch.asp?league=
I am trying to filter them by:
oversdf <- data.frame(URLs=URLs)
rownames(oversdf) # This returns 1,2,3,4 etc as expected
grep("^soccerstats.com/pmatch.asp?league",rownames(oversdf)) # This then doesn't return any results
Any ideas what I am doing wrong, I just want to return all URLs that contain a certain string only.
Cheers
library(rvest)
URL <- "http://www.soccerstats.com/matches.asp" #Feed page
WS <- read_html (URL) #reads webpage into WS variable
URLs <- WS %>% html_nodes ("a:nth-child(1)") %>% html_attr("href") %>% as.character() # Get the CSS nodes & extract the URLs
URLs <- paste0("http://www.soccerstats.com/",URLs)
oversdf <- data.frame(URLs=URLs)
rownames(oversdf) #returns a vector of row names in the overs data.frame:
grep("^pmatch.asp?league",oversdf$URLs)
URL <- "http://www.soccerstats.com/matches.asp" #Feed page WS <- read_html (URL) #reads webpage into WS variable URLs <- WS %>% html_nodes ("a:nth-child(1)") %>% html_attr("href") %>% as.character() # Get the CSS nodes & extract the URLs URLs <- paste0("http://www.soccerstats.com/",URLs) oversdf <- data.frame(URLs=URLs) rownames(oversdf) #returns a vector of row names in the overs data.frame: grep("^http://www.soccerstats.com/pmatch.asp?league",oversdf$URLs)
– Pete Stilgoe Jun 23 '17 at 09:48