Non-Latin characters show as question marks when using rodbc/odbc/dbplyr with SQL-Server

Question

I'm using dbplyr to get data from SQL-Server into R, but Chinese, Japanese and other non-Latin characters are appearing as "?". I'm using a windows machine.

I've read through the following threads:

These provide some useful ideas, but nothing has worked so far. I have tried:

Setting encoding = 'UTF-8' within the dbConnect function. Characters still show as question-marks.
Setting encoding = 'UTF-16' within the dbConnect function. R returns an error: # Error in iconv(x[current], from = enc, to = to, ...)
Changing the global character encoding to UTF-8 with: Sys.setenv(LANG = "UTF-8") and options(encoding = "UTF-8")
Checking if the characters display when plotting (which would indicate that they are being stored correctly). This wasn't the case.

I was able to get the characters to display correctly by using RJDBC, however this is not compatible with dbplyr, according to this GitHub issue.

Here is my session info:

> sessionInfo()
# R version 3.5.0 (2018-04-23)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)

# Matrix products: default

# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252

My code looks like this:

> con <- dbConnect(odbc(),
                   Driver   = "SQL Server",
                   Server   = "server name",
                   Database = "database name",
                   user     = "my username",
                   password = "my password",
                   encoding = "UTF-8")

odbc/dbplyr sure handles these character types on Windows, so what am I missing here?

Any help would be much appreciated!

How are you passing these characters to SQL Server? SQL Server 2017 and prior don't support UTF-8 either. — Thom A, Nov 19 '19 at 16:40
Are you sure the encoding is UTF-8? Windows typically uses Latin-1 encoding. — MrFlick, Nov 19 '19 at 16:44
@larnu The characters appear properly when I query the database directly in SSMS (i'm not sure if i can fully answer your question as I'll have to ask around in my team!) — MaxL, Nov 19 '19 at 17:15
@MrFlick This has been a confusing point for me. Some places suggest that the default windows encoding is Latin1, others say GB 1252, or similar. I'm new to character encodings, so I probably have some misconceptions . From the threads I've read it sounds like UTF-8 is what I need to see Chinese/Japanese characters properly — MaxL, Nov 19 '19 at 17:15
Certainly UTF-8 is the wrong choice. 2019 was the first version to support UTF-8. — Thom A, Nov 19 '19 at 17:56
I suggest to load sample of data from SQL to csv file (check encoding of csv file and if data is not corrupted with ? symbols). If everything is good, try to load data in R from this file. — gofr1, Nov 19 '19 at 18:53
@gofr1 this would be a good option although in the long term I'll really need to pull data from the sql server directly — MaxL, Nov 21 '19 at 17:30

score 0 · Answer 1 · answered Nov 20 '20 at 11:01

Check the list of encodings available with iconvlist(). I used encoding = "windows-1252" to be able to work correctly with Nordic characters using ODBC version 1.2.2.

Although I have not used Chinese or Japanese characters, the encoding values "GB18030", "gb2312" and "GBK" could be used for Chinese Guobiao for example.

wikipedia has a helpful page (scroll to the bottom for the list).

Non-Latin characters show as question marks when using rodbc/odbc/dbplyr with SQL-Server

1 Answers1

Linked