0

On querying a table from Oracle database that contains Arabic data, R is not able to encode correct characters.

The following is an example of what I am trying to do:

library(DBI)
library(dplyr)
#My local environment is set to Arabic
Sys.setlocale("LC_CTYPE","arabic")

# My connection string to database
db_conn <- dbConnect(odbc::odbc(), "my_db", encoding = "utf-8")

# Extracting data
my_table <- tbl(db_conn, "my_table")
result <-
      my_table %>%
      filter(ID == 100010456) %>%
      select(ID, ADDRESS_1) %>%
      collect()

This is the result I am getting:

enter image description here

However this is the actual data in the database:

enter image description here

Some further notes:

  • The NLS_CHARACTERSET of the database is AR8ISO8859P6
  • I have tried dbconnect with different encoding options (windows-1252, utf-8)
  • I have tried adding charset option in dbconnect
  • I have tried using dbGetQuery to retrieve results explicitly
Nareman Darwish
  • 1,251
  • 7
  • 14
  • You need to make sure you are importing with the same encoding as stored in the database. You can see all the known encodings on your computer with `stringi::stri_enc_list()`. Given the value "AR8ISO8859P6", maybe `latin6` is a good match? Try `db_conn <- dbConnect(odbc::odbc(), "my_db", encoding = "latin6")`. – MrFlick May 01 '21 at 00:49
  • @MrFlick The encoding list contains 232, I have tried latin6 as you said however I got ¿¿¿¿ as a result instead of the previous triangles. – Nareman Darwish May 01 '21 at 01:09
  • Well, it's hard to see what's going on without being able to work on a reproducible example. Perhaps you can ask your data base administrator for some advice. Likely they will have previous experience with encodings working with other tools and can offer advice. – MrFlick May 01 '21 at 01:11

1 Answers1

1

Here are several different approaches to try. Please comment as to whether they assist.

1. Query the database as to what its encoding is

This question starts with the following SQL code snippet for getting the encoding in a PostgreSQL database (it has several answers which might also assist):

dbGetQuery(con, "SHOW CLIENT_ENCODING")
#   client_encoding
# 1            UTF8

If you are not using a PostgreSQL database there will likely be an equivalent command for the database you are using.

Reading the documentation of ?odbc::dbConnect is appears encoding is the text encoding for your database (if it is not UTF8). And that strings are always returned UTF-8 encoded.

Given your note that the character set of the database is AR8ISO8859P6, I guess the client encoding will return something like "ar8-iso8859" and that this is the term to put into the connection. E.g.:

db_conn <- dbConnect(odbc::odbc(), "my_db", encoding = "ar8-iso8859")

2. Test every available encoding

You mentioned in a comment there are 232 possible encodings. This link shows a function for testing how outputs differ with two different encodings.

If you can not get the encoding from the database, then iterating through and testing all encoding might be the best option. Perhaps something like this:

testEncoding <- function(encoding){
  # My connection string to database
  db_conn <- dbConnect(odbc::odbc(), "my_db", encoding = encoding)
  # Extracting data
  result <- tbl(db_conn, "my_table") %>%
      filter(ID == 100010456) %>%
      select(ADDRESS_1) %>%
      collect()
  # disconnect & return
  dbDisconnect(db_conn)
  return(result[1,2])
}

list_of_encodings <- stringi::stri_enc_list()
for(encoding in list_of_encodings){
  print(paste(encoding[1], " | ", testEncoding(encoding[1])))
}

3. Run a test writing out an intermediate text file

The database should allow you to export data as a csv or equivalent file. This file can then be read into R using standard approaches:

  1. Export subset of data containing Arabic characters to plain text file (most likely UTF-8).
  2. Examine contents of file to confirm Arabic characters appear.
  3. Import plain text file into R. Examine imported data frame.
  4. Export imported data frame from R back to plain text (UTF-8).
  5. Examine contents of second file to confirm Arabic characters appear.

This approach should give you some additional insight into where the communication between R an the database is failing. Do you require specific setting to export Arabic characters from the database? To import them into R? Does R read and write the Arabic characters correctly, but fail to display them?

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Thank you so much Simon for the complete reply. The encoding option in 1st option is not available on my machine, the loop was very smart idea however no option on my machine was able to encode it correctly. I exported the data and reimported it to R and it worked fine. I think there is encoding option on my machine so I will try to fix this. – Nareman Darwish May 03 '21 at 10:53
  • Glad you found it helpful. Once you get it working, consider posting an answer to your/this question. I'd been keen to learn from it, and it will help others with the same challenge. – Simon.S.A. May 03 '21 at 21:00