1

I'm trying to open a GTFS file that has UTF-8 encoding, but even though I changed my project's encoding in R to UTF-8, the characters are still truncated. The problem can be seen in the "stop_name" column. I'm using windows 10 and I know there are some encoding issues with R, but I have no idea what it is.

Reproducible example:

install.packages('gtfstools')
library(gtfstools)

# GTFS file directory
data_path <- system.file("extdata", package = "gtfstools")
spo_path <- file.path(data_path, "spo_gtfs.zip")

# read the file
spo_gtfs <- read_gtfs(spo_path)

# Show the stops (problem with encoding)
head(spo_gtfs$stops)

Output: enter image description here

Session info:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                       LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.1.2 tools_4.1.2   
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
Igor
  • 145
  • 8
  • Not able to reproduce on an Ubuntu machine. I'm getting what I assume to be the correct accents and descenders. Are you sure this isn't a locale and font problem? (You have not offered the generally needed `sessionInfo()` output.) – IRTFM Jan 21 '22 at 02:01
  • For `stop_name` i get: `"Clínicas" "Vila Madalena" "Consolação" "Conceição" "Jabaquara" "São Judas"`. If you look carefully at the first item you see that the first i is different than the second one. (I'm assuming this is Romanian or Castellano or some other Romance language with an alphabet that has non-English descenders.) These look pretty much the same to me in Chrome with US locale, but it's hard to know what you expected **or** what you are seeing.) – IRTFM Jan 21 '22 at 02:59
  • @IRTFM Just updated the post with the sessionInfo() output. – Igor Jan 21 '22 at 12:17
  • @IRTFM Those names you showed are correct. The words are in Portuguese. Do you have any idea why I can't see the same way? I already changed the encoding of the project in R studio to UTF-8 and also saved the file with the encoding in UTF-8. – Igor Jan 21 '22 at 12:39
  • So this seems to be a Wndows problem. the answer seems to be that your installation's locale has a default Windows font with its 1252 codepage mappings. It's ironic but perhaps unavoidable that a protocol that is touted as being "generalized" is not able to handle this issue. – IRTFM Jan 21 '22 at 16:18

1 Answers1

4

You just need to use the encoding parameter on read_gtfs():

library(gtfstools)

# GTFS file directory
data_path <- system.file("extdata", package = "gtfstools")
spo_path <- file.path(data_path, "spo_gtfs.zip")

# read the file
spo_gtfs <- read_gtfs(spo_path, encoding = "UTF-8")

# Show the stops (problem with encoding)
head(spo_gtfs$stops)
#>    stop_id     stop_name stop_desc  stop_lat  stop_lon
#> 1:   18848      Clínicas           -23.55402 -46.67111
#> 2:   18849 Vila Madalena           -23.54650 -46.69114
#> 3:   18850    Consolação           -23.55809 -46.66020
#> 4:   18851     Conceição           -23.63504 -46.64124
#> 5:   18852     Jabaquara           -23.64600 -46.64103
#> 6:   18853     São Judas           -23.62588 -46.64094
dhersz
  • 525
  • 2
  • 8