5

I'm using dbplyr to get data from SQL-Server into R, but Chinese, Japanese and other non-Latin characters are appearing as "?". I'm using a windows machine.

I've read through the following threads:

These provide some useful ideas, but nothing has worked so far. I have tried:

  • Setting encoding = 'UTF-8' within the dbConnect function. Characters still show as question-marks.

  • Setting encoding = 'UTF-16' within the dbConnect function. R returns an error: # Error in iconv(x[current], from = enc, to = to, ...)

  • Changing the global character encoding to UTF-8 with: Sys.setenv(LANG = "UTF-8") and options(encoding = "UTF-8")

  • Checking if the characters display when plotting (which would indicate that they are being stored correctly). This wasn't the case.

I was able to get the characters to display correctly by using RJDBC, however this is not compatible with dbplyr, according to this GitHub issue.

Here is my session info:

> sessionInfo()
# R version 3.5.0 (2018-04-23)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)

# Matrix products: default

# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252

My code looks like this:

> con <- dbConnect(odbc(),
                   Driver   = "SQL Server",
                   Server   = "server name",
                   Database = "database name",
                   user     = "my username",
                   password = "my password",
                   encoding = "UTF-8")

odbc/dbplyr sure handles these character types on Windows, so what am I missing here?

Any help would be much appreciated!

MaxL
  • 101
  • 1
  • 4
  • 1
    How are you passing these characters to SQL Server? SQL Server 2017 and prior don't support UTF-8 either. – Thom A Nov 19 '19 at 16:40
  • Are you sure the encoding is UTF-8? Windows typically uses Latin-1 encoding. – MrFlick Nov 19 '19 at 16:44
  • @larnu The characters appear properly when I query the database directly in SSMS (i'm not sure if i can fully answer your question as I'll have to ask around in my team!) – MaxL Nov 19 '19 at 17:15
  • @MrFlick This has been a confusing point for me. Some places suggest that the default windows encoding is Latin1, others say GB 1252, or similar. I'm new to character encodings, so I probably have some misconceptions . From the threads I've read it sounds like UTF-8 is what I need to see Chinese/Japanese characters properly – MaxL Nov 19 '19 at 17:15
  • Certainly UTF-8 is the wrong choice. 2019 was the first version to support UTF-8. – Thom A Nov 19 '19 at 17:56
  • I suggest to load sample of data from SQL to csv file (check encoding of csv file and if data is not corrupted with ? symbols). If everything is good, try to load data in R from this file. – gofr1 Nov 19 '19 at 18:53
  • @gofr1 this would be a good option although in the long term I'll really need to pull data from the sql server directly – MaxL Nov 21 '19 at 17:30
  • @larnu I'm told the SQL server uses 'Latin1_General_CI_AS' – MaxL Nov 21 '19 at 17:33
  • @MaxL Just check it. – gofr1 Nov 21 '19 at 19:10

1 Answers1

0

Check the list of encodings available with iconvlist(). I used encoding = "windows-1252" to be able to work correctly with Nordic characters using ODBC version 1.2.2.

Although I have not used Chinese or Japanese characters, the encoding values "GB18030", "gb2312" and "GBK" could be used for Chinese Guobiao for example.

wikipedia has a helpful page (scroll to the bottom for the list).

Zaki
  • 131
  • 3