6

I am having some trouble importing the following file: http://www.kuleuven.be/bio/ento/temp/test.xlsx into R in the correct encoding. In particular,

library("xlsx")
read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")

gives me

                                             X1
1                                     a-cadinol
2                                  a-calacorene
3                       a-caryophyllene alcohol
4                                   a-curcumene
5                                      a-elemol
6                                   a-muurolene
7                           a-terpineol acetate
8  ß-4-dimethyl-3-cyclohexane-1-ethanol acetate
9                                  ß-bisabolene
10                                  ß-bisabolol
11                                 ß-bourbonene
12                      ß-caryophyllene alcohol
13                                ß-cyclocitral
14                                   ß-farnesol
15                                   ß-selinene
16                         ß-sesquiphellandrene
17                            <U+03B3>-cadinene
18  <U+03B3>-Carboethoxy-<U+03B3>-butyrolactone
19        <U+03B3>-ethyl-<U+03B3>-butyrolactone
20                            <U+03B3>-eudesmol
21                           <U+03B3>-muurolene
22                         <U+03B3>-nonalactone
23                         <U+03B3>-octalactone
24                            <U+03B3>-selinene
25                       <U+03B3>-undecalactone
26                                   d-cadinene
27                                    d-cadinol
28                                  d-muurolene
29                              d-undecalactone

but the a-, <U+03B3>- and d- should be alpha-, gamma- and delta-

Any thoughts on how I could import my file in the correct encoding?

I am working on Windows, and iconvlist() gives me

  [1] "437"                     "850"                     "852"                     "855"                     "857"                    
  [6] "860"                     "861"                     "862"                     "863"                     "865"                    
 [11] "866"                     "869"                     "ANSI_X3.4-1968"          "ANSI_X3.4-1986"          "ASCII"                  
 [16] "ASMO-708"                "BIG-5"                   "BIG-FIVE"                "big5"                    "BIG5"                   
 [21] "big5-hkscs"              "BIG5-HKSCS"              "big5hkscs"               "BIG5HKSCS"               "CP-GR"                  
 [26] "CP-IS"                   "cp1025"                  "CP1125"                  "CP1133"                  "CP1200"                 
 [31] "CP12000"                 "CP12001"                 "CP1201"                  "CP1250"                  "CP1251"                 
 [36] "CP1252"                  "CP1253"                  "CP1254"                  "CP1255"                  "CP1256"                 
 [41] "CP1257"                  "CP1258"                  "CP1361"                  "CP154"                   "CP367"                  
 [46] "CP437"                   "CP50221"                 "CP51932"                 "CP65001"                 "CP737"                  
 [51] "CP775"                   "CP819"                   "CP850"                   "CP852"                   "CP853"                  
 [56] "CP855"                   "CP857"                   "CP858"                   "CP860"                   "CP861"                  
 [61] "CP862"                   "CP863"                   "CP864"                   "CP865"                   "cp866"                  
 [66] "CP866"                   "CP869"                   "CP874"                   "cp875"                   "CP932"                  
 [71] "CP936"                   "CP949"                   "CP950"                   "CSASCII"                 "CSIBM855"               
 [76] "CSIBM857"                "CSIBM860"                "CSIBM861"                "CSIBM863"                "CSIBM864"               
 [81] "CSIBM865"                "CSIBM866"                "CSIBM869"                "csISO2022JP"             "CSISOLATIN1"            
 [86] "CSPC775BALTIC"           "CSPC850MULTILINGUAL"     "CSPC862LATINHEBREW"      "CSPC8CODEPAGE437"        "CSPCP852"               
 [91] "CSPTCP154"               "CSWINDOWS31J"            "CYRILLIC-ASIAN"          "DOS-720"                 "DOS-862"                
 [96] "EUC-CN"                  "euc-jp"                  "euc-kr"                  "EUC-KR"                  "EUCCN"                  
[101] "eucjp"                   "euckr"                   "GB18030"                 "gb2312"                  "GBK"                    
[106] "hz-gb-2312"              "IBM-CP1133"              "IBM-Thai"                "IBM00858"                "IBM00924"               
[111] "IBM01047"                "IBM01140"                "IBM01141"                "IBM01142"                "IBM01143"               
[116] "IBM01144"                "IBM01145"                "IBM01146"                "IBM01147"                "IBM01148"               
[121] "IBM01149"                "IBM037"                  "IBM1026"                 "IBM273"                  "IBM277"                 
[126] "IBM278"                  "IBM280"                  "IBM284"                  "IBM285"                  "IBM290"                 
[131] "IBM297"                  "IBM367"                  "IBM420"                  "IBM423"                  "IBM424"                 
[136] "IBM437"                  "IBM437"                  "IBM500"                  "ibm737"                  "ibm775"                 
[141] "IBM775"                  "IBM819"                  "ibm850"                  "IBM850"                  "ibm852"                 
[146] "IBM852"                  "IBM855"                  "IBM855"                  "ibm857"                  "IBM857"                 
[151] "IBM860"                  "IBM860"                  "ibm861"                  "IBM861"                  "IBM862"                 
[156] "IBM863"                  "IBM863"                  "IBM864"                  "IBM864"                  "IBM865"                 
[161] "IBM865"                  "IBM866"                  "ibm869"                  "IBM869"                  "IBM870"                 
[166] "IBM871"                  "IBM880"                  "IBM905"                  "iso-2022-jp"             "iso-2022-jp"            
[171] "ISO-2022-JP"             "ISO-2022-JP-MS"          "iso-2022-kr"             "ISO-8859-1"              "iso-8859-13"            
[176] "iso-8859-15"             "iso-8859-2"              "iso-8859-3"              "iso-8859-4"              "iso-8859-5"             
[181] "iso-8859-6"              "iso-8859-7"              "iso-8859-8"              "iso-8859-8-i"            "iso-8859-9"             
[186] "ISO-IR-100"              "ISO-IR-6"                "ISO_646.IRV:1991"        "ISO_8859-1"              "ISO_8859-1:1987"        
[191] "ISO2022-JP"              "ISO2022-JP-MS"           "iso2022-kr"              "ISO646-US"               "iso8859-1"              
[196] "ISO8859-1"               "iso8859-13"              "iso8859-15"              "iso8859-2"               "iso8859-3"              
[201] "iso8859-4"               "iso8859-5"               "iso8859-6"               "iso8859-7"               "iso8859-8"              
[206] "iso8859-8-i"             "iso8859-9"               "Johab"                   "JOHAB"                   "koi8-r"                 
[211] "koi8-u"                  "ks_c_5601-1987"          "L1"                      "latin-9"                 "LATIN1"                 
[216] "latin2"                  "latin3"                  "latin4"                  "latin5"                  "latin7"                 
[221] "latin9"                  "mac"                     "mac-centraleurope"       "mac-is"                  "macarabic"              
[226] "maccentraleurope"        "maccroatian"             "maccyrillic"             "macgreek"                "machebrew"              
[231] "maciceland"              "macintosh"               "macis"                   "macroman"                "macromania"             
[236] "macthai"                 "macturkish"              "macukraine"              "macukrainian"            "MS-ANSI"                
[241] "MS-ARAB"                 "MS-CYRL"                 "MS-EE"                   "MS-GREEK"                "MS-HEBR"                
[246] "MS-TURK"                 "MS50221"                 "MS51932"                 "MS932"                   "MS936"                  
[251] "PT154"                   "PTCP154"                 "SHIFFT_JIS"              "SHIFFT_JIS-MS"           "shift-jis"              
[256] "shift_jis"               "SJIS"                    "SJIS-MS"                 "SJIS-OPEN"               "SJIS-WIN"               
[261] "UCS-2"                   "UCS-2BE"                 "UCS-2LE"                 "UCS-4"                   "UCS-4BE"                
[266] "UCS-4BE"                 "UCS-4LE"                 "UCS-4LE"                 "UCS2"                    "UCS2BE"                 
[271] "UCS2LE"                  "UCS4"                    "UCS4BE"                  "UCS4LE"                  "UHC"                    
[276] "unicodeFFFE"             "US"                      "US-ASCII"                "UTF-16"                  "UTF-16BE"               
[281] "UTF-16LE"                "UTF-32"                  "UTF-32BE"                "UTF-32LE"                "UTF-8"                  
[286] "UTF16"                   "UTF16BE"                 "UTF16LE"                 "UTF32"                   "UTF32BE"                
[291] "UTF32LE"                 "UTF8"                    "WINBALTRIM"              "windows-1250"            "windows-1251"           
[296] "windows-1252"            "windows-1253"            "windows-1254"            "windows-1255"            "windows-1256"           
[301] "windows-1257"            "windows-1258"            "WINDOWS-31J"             "WINDOWS-50221"           "WINDOWS-51932"          
[306] "windows-874"             "WINDOWS-932"             "WINDOWS-936"             "x-Chinese_CNS"           "x-cp20001"              
[311] "x-cp20003"               "x-cp20004"               "x-cp20005"               "x-cp20261"               "x-cp20269"              
[316] "x-cp20936"               "x-cp20949"               "x-cp50227"               "x-EBCDIC-KoreanExtended" "x-Europa"               
[321] "x-IA5"                   "x-IA5-German"            "x-IA5-Norwegian"         "x-IA5-Swedish"           "x-iscii-as"             
[326] "x-iscii-be"              "x-iscii-de"              "x-iscii-gu"              "x-iscii-ka"              "x-iscii-ma"             
[331] "x-iscii-or"              "x-iscii-pa"              "x-iscii-ta"              "x-iscii-te"              "x-mac-arabic"           
[336] "x-mac-ce"                "x-mac-chinesesimp"       "x-mac-chinesetrad"       "x-mac-croatian"          "x-mac-cyrillic"         
[341] "x-mac-greek"             "x-mac-hebrew"            "x-mac-icelandic"         "x-mac-japanese"          "x-mac-korean"           
[346] "x-mac-romanian"          "x-mac-thai"              "x-mac-turkish"           "x-mac-ukrainian"         "x_Chinese-Eten"   

I tried with many of these, to no avail... Unfortunately, I also don't know what encoding Excel saved my file in...

Also, is there any easy function in R that would allow me to convert all Greek alpha, beta, gamma and delta's (in original encoding) to "alpha", "beta", "gamma" and "delta" (ie written out in full)? Or to do the reverse, ie convert "alpha", "beta", "gamma" etc written out in full to single Greek characters?

EDIT: regarding my last question I tried

togreek=function(compname) {
  n=as.character(compname,encoding="UTF-8")
  n=gsub("alpha","\u03B1",n)
  n=gsub("beta","\u03B2",n)
  n=gsub("gamma","\u03B3",n)
  n=gsub("delta","\u03B4",n)
  n=gsub("epsilon","\u03B5",n)
  n
}

tolatin=function(compname) {
  n=as.character(compname,encoding="UTF-8")
  n=gsub("\u03B1","alpha",n)
  n=gsub("\u03B2","beta",n)
  n=gsub("\u03B3","gamma",n)
  n=gsub("\u03B4","delta",n)
  n=gsub("\u03B5","epsilon",n)
  n
}

The function tolatin seems to work:

library("xlsx")
test=read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")
tolatin(test$X1)
 [1] "alpha-cadinol"                                   "alpha-calacorene"                                "alpha-caryophyllene alcohol"                    
 [4] "alpha-curcumene"                                 "alpha-elemol"                                    "alpha-muurolene"                                
 [7] "alpha-terpineol acetate"                         "beta-4-dimethyl-3-cyclohexane-1-ethanol acetate" "beta-bisabolene"                                
[10] "beta-bisabolol"                                  "beta-bourbonene"                                 "beta-caryophyllene alcohol"                     
[13] "beta-cyclocitral"                                "beta-farnesol"                                   "beta-selinene"                                  
[16] "beta-sesquiphellandrene"                         "gamma-cadinene"                                  "gamma-Carboethoxy-gamma-butyrolactone"          
[19] "gamma-ethyl-gamma-butyrolactone"                 "gamma-eudesmol"                                  "gamma-muurolene"                                
[22] "gamma-nonalactone"                               "gamma-octalactone"                               "gamma-selinene"                                 
[25] "gamma-undecalactone"                             "delta-cadinene"                                  "delta-cadinol"                                  
[28] "delta-muurolene"                                 "delta-undecalactone"  

But if I then convert back to Greek characters I again run into problems:

togreek(tolatin(test$X1))

 [1] "α-cadinol"                                   "α-calacorene"                                "α-caryophyllene alcohol"                    
 [4] "α-curcumene"                                 "α-elemol"                                    "α-muurolene"                                
 [7] "α-terpineol acetate"                         "ß-4-dimethyl-3-cyclohexane-1-ethanol acetate" "ß-bisabolene"                                
[10] "ß-bisabolol"                                  "ß-bourbonene"                                 "ß-caryophyllene alcohol"                     
[13] "ß-cyclocitral"                                "ß-farnesol"                                   "ß-selinene"                                  
[16] "ß-sesquiphellandrene"                         "<U+03B3>-cadinene"                            "<U+03B3>-Carboethoxy-<U+03B3>-butyrolactone" 
[19] "<U+03B3>-ethyl-<U+03B3>-butyrolactone"        "<U+03B3>-eudesmol"                            "<U+03B3>-muurolene"                          
[22] "<U+03B3>-nonalactone"                         "<U+03B3>-octalactone"                         "<U+03B3>-selinene"                           
[25] "<U+03B3>-undecalactone"                       "d-cadinene"                                   "d-cadinol"                                   
[28] "d-muurolene"                                  "d-undecalactone"  

Any thoughts what I am doing wrong?

Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103
  • 1
    Running those commands you provided works out perfectly fine for me, those Greek character display fine. My locale is `Japanese_Japan.932`, so I expect that `encoding="UTF-8"` is correct. This might have something to do with your environment settings. What does your `sessionInfo()` say? – buruzaemon Jan 04 '14 at 02:42
  • Ha it says locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 What should I change it to then? – Tom Wenseleers Jan 04 '14 at 10:52
  • Or could it be some problem with R studio perhaps? – Tom Wenseleers Jan 04 '14 at 11:18
  • 1
    It seems there is some problem with the way the output is displayed. If I type test=read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8") test$X1 it shows OK. So maybe I should change my locale. If I try Sys.setlocale("LC_ALL", 'en_US.UTF-8') I get the error Warning message: In Sys.setlocale("LC_ALL", "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored though.... Others have reported the error too, http://stackoverflow.com/questions/5881731/strange-characters-interaction-of-r-and-windows-locale – Tom Wenseleers Jan 04 '14 at 11:25

2 Answers2

5

Try this:
Sys.setlocale(category = "LC_ALL", locale = "Greek")

dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
0

Try this:

library(stringi)

?stringi

Apply this to see your file codification

stri_enc_detect("path-to-your-file/your-file.csv", filter_angle_brackets = T)

stri_enc_detect2("path-to-your-file/your-file.csv", locale = NULL)

The first one works for me.

Then apply the result in the read.csv(), as the example bellow:

    df <- read.csv("path-to-your-file/your-file.csv",header = TRUE, sep = ";", 
quote = "\"", na.strings = "", dec = ".", skip = 2, check.names = T, 
fileEncoding = "YOUR RESULT OF stri_enc_detect", encoding = "UTF-8")