1

i am scraping page Crickbuzz scores for getting match details. i am using selector gadget for getting css tag. things i have done so far is :

crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
matches_dates <- crickbuzz %>%
html_nodes(".schedule-date:nth-child(1)") %>%
html_text()

i have fetched matches , scores and venues , but having difficulty in fetching dates. i am getting below result from above code

> matches_dates
     "   -     " "   -     " "   "       "   "       "   "       "   "   "  "      
    "   "       "   "       "   "       "   -     " "   -     " "   -     "

means getting 21 element , that is right as there is 21 matches currently , but not getting text.

Then i had seen what is coming in html_nodes() and it is giving like :

{xml_nodeset (21)}
 1 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">    
   </span>
2 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">    
   </span>
3 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">    
   </span> and so on....

this means i am not getting text from the tag. How to do that ?

KrunalParmar
  • 1,062
  • 2
  • 18
  • 31

1 Answers1

0

You need to extract that using timestamp attribute:

library(rvest)
crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
matches_dates <- crickbuzz %>%
    html_nodes(".schedule-date:nth-child(1)")%>%
   html_attr("timestamp")

matches_dates
 [1] "1452268800000" "1452132000000" "1452247200000" "1452242400000" "1452327000000" "1452290400000" "1452310200000" "1452310200000" "1452310200000"
[10] "1452310200000" "1452324600000" "1452324600000" "1452324600000" "1452324600000" "1452324600000" "1452150000000" "1452153600000" "1452153600000"

# this is the unix time and so if you need to convert to date-time format, follow the answer
 to this question: 
http://stackoverflow.com/questions/13456241/convert-unix-epoch-to-date-object-in-r
user227710
  • 3,164
  • 18
  • 35