4

so im trying to make a calendar (dataframe) with the soccer matches coming. Im webscraping the columns one by one because i dont need them all. When scraping the column with the timedate (HORA) i get a values that are incorrect, dont know why... i dont think it has to be with timezone because its just text.

library(rvest)
url <- "https://www.cruzados.cl/competitions/campeonato-nacional"
page <- read_html(url)

hora_inicio <- page %>% html_nodes("td.team-schedule__time") %>% html_text()

> hora_inicio
[1] "21:00" "22:30" "23:15" "22:30" "00:30" "00:00" "02:00" "02:00" "19:00" "22:00" "19:00" "22:15" "19:00" "02:00"
[15] "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00"
[29] "19:00" "19:00" "19:00" "19:00" "19:00" "20:00" "20:00" "20:00" "20:00" "20:00" "20:00"

the right ones are: 18:00, 19:30, 19:15, 18:30, 20:30, 20:00, 18:00 , ...

Phil
  • 7,287
  • 3
  • 36
  • 66

1 Answers1

3

In fact, the datetime shown in the html result is in UTC timezone. JS is updating the result according to your timezone.

The following will extract the dates and times, combine them and convert UTC datetimes into your current timezone :

library(rvest)

url <- "https://www.cruzados.cl/competitions/campeonato-nacional"
page <- read_html(url)

Sys.setlocale(locale="es_ES.UTF-8")

date <- page %>% html_nodes("td.team-schedule__date") %>% html_text()
time <- page %>% html_nodes("td.team-schedule__time") %>% html_text()

dates <- as.Date(gsub("sept", "sep", date), format="%a. %d / %b. / %Y") #dom. 21 / mar. / 2021

i <- 1
tzDates <- list()
for(date in as.list(dates)) {
  utcDate <- as.POSIXct(paste0(format(date, "%Y-%m-%d")," ",time[i]), format="%Y-%m-%d %H:%M",tz = "UTC")
  tzDates[[i]] <- as.POSIXlt(utcDate, tz = Sys.timezone())
  i <- i+1
}
print(tzDates)

You will need the locale es_ES.UTF-8 or es_CL.UTF-8 to be installed in order to get the abbreviated month/weekday in spanish.


In my case, I'm located in France, you can see the time change on 28th march from UTC+1 to UTC+2 :

[1] "2021-03-21 22:00:00 CET"
[2] "2021-03-29 00:30:00 CEST"
[3] "2021-04-05 01:15:00 CEST"

An the html returned is (UTC) :

[1] "2021-03-21 21:00"
[2] "2021-03-28 22:30"
[3] "2021-04-04 23:15"
Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159
  • 1
    Thank you! worked just fine. I didn't realize before about the action of js updating the tz. It didn't work for me the line: as.Date(gsub("sept", "sep",.....) didnt match the format, but i figured it out. Again, thanks! Really nice and clear explanation – vinsfontecilla Apr 22 '21 at 04:37