3

I am using the osmdata package to extract data from Open Street Map (OSM) and turn it into a sf object. Unfortunately, I have not found a way to get the encoding right using the functions of the osmdata and sf package. Currently, I am changing the encoding afterwards via the Encoding function, which is quite cumbersome because it involves a nested loop (over all data frames contained in the object returned from Open Street Map, and over all character columns within these data frames).

Is there a more generic, nicer way to get the encoding right?

The following code shows the problem. It extracts OSM data on pharmacy's in the German city Neumünster:

library(osmdata)
library(sf) 
library(purrr)

results <- opq(bbox = "Neumünster, Germany") %>%
   add_osm_feature(key = "amenity", value = "pharmacy") %>% 
   osmdata_sf()
pharmacy_points <- results$osm_points
head(pharmacy_points$addr.city)

enter image description here

My locale and encoding is set as follows: enter image description here

My current, but unsatisfactory, solution is the following:

encode_osm <- function(list){
  # For all data frames in query result
  for (df in (names(list)[map_lgl(list, is.data.frame)])) {
    last <- length(list[[df]])
    # For all columns except the last column ("geometry")
    for (col in names(list[[df]])[-last]){
      # Change the encoding to UTF8
      Encoding(list[[df]][[col]]) <- "UTF-8"
    }
  }
  return(list)
}

results_encoded <- encode_osm(results)
Till
  • 707
  • 3
  • 14
  • Try with `UTF-8` as follows: `Sys.setlocale(category = "LC_CTYPE", locale="German_Germany.65001")` (and maybe switch to `UTF-8` in all locale categories). – JosefZ Nov 23 '20 at 09:50
  • Thanks for the hint. Your first suggestion produces the warning "OS reports request to set the locale ... cannot be honoured". I tried some other locale settings, but so far I have not been successful. – Till Nov 23 '20 at 10:28
  • To better understand the problem, it would be interesting to know whether other people (different countries, different locale settings) have the same encoding problem when using my code from above. – Till Nov 23 '20 at 11:15
  • I can confirm the same encoding problem with `LC_CTYPE=Czech_Czechia.1250`; I see `"NeumĂĽnster"` due to [mojibake](https://en.wikipedia.org/wiki/Mojibake) (`CP1252` => `CP1250`). See also [UTF-8 support in R on Windows](https://stackoverflow.com/a/62732990/3439404); I'm confused forever… – JosefZ Nov 23 '20 at 13:44
  • 1
    Take a look at [Windows/UTF-8 Build of R and CRAN Packages](https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/). – JosefZ Nov 23 '20 at 15:06
  • I can confirm that this is likely a platform issue - on Linux I get the expected results `[1] "Neumünster" "Neumünster" "Neumünster" "Neumünster" NA "Neumünster"`. When I was using Windows, and running in to similar issues - see https://stackoverflow.com/questions/46946483/czech-encoding-in-r - I resolved them by setting a sort of a Frankenstein locale `Sys.setlocale(category = 'LC_ALL','English_United States.1250')`; it might work if you replace 1250 (Czech encoding) with 66001 (ümlaut friendly). – Jindra Lacko Nov 25 '20 at 11:42
  • Thanks @JindraLacko, but this also won't work, unfortunately. – Till Nov 26 '20 at 09:50
  • @Till sorry to hear that... I don't have R on a Windows machine on hand, so I can't help you further (except in extending sympathy, platform issues are a nightmare and the English speaking folks with ASCI only are having it too easy) – Jindra Lacko Nov 26 '20 at 11:04

0 Answers0