11

I have a .csv data, and I could view it from a webpage, but when I read it into R, some of the data couldn't be showed. The data is available here home.ustc.edu.cn/~lanrr/data.csv

mydata = read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", header = T)
View(mydata)  # show something like this:
# 9:39:37   665 600160  �޻��ɷ�  ����    ����    8.050   100 805.00  ��ȯ �ɽ�        
  ��ȯ����   E004017669  665
  2 9:39:38 697 930 ��������    ����    ����    4.360   283 1233.88    
  ����  �ɽ� ����Ʒ����   680001369   697

The data contains some Chinese words, but I don't if I need to change the encode or do some other things, has anyone meet this problem before?

mydata = read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", 
                   encoding = "UTF-8", header = T, stringsAsFactors = F)
View(mydata)
# 9:39:37   665 600160  <U+00BE><U+07BB><U+00AF><U+00B9><U+0277><dd>    <c2><f4>  
  <U+00B3><f6>  <c2><f2><c2><f4>    8.050   100 805.00  <c8><da><U+022F>     
  <U+00B3><U+027D><U+00BB>  <c8><da><U+022F><c2><f4><U+00B3><f6>    E004017669  665
  2 9:39:38 697 930 <d6><d0><U+0078><c9><fa><U+00BB><U+00AF>    <c2><f4>
  <U+00B3><f6>  <c2><f2><c2><f4>    4.360   283 1233.88 <d0><c5><d3><c3>    
  <U+00B3><U+027D><U+00BB>  <U+00B5><U+00A3><U+00B1><U+00A3><U+01B7><c2><f4><U+00B3> 
  <f6>  680001369   697

sessionInfo()
# R version 2.15.2 (2012-10-26)
  Platform: x86_64-redhat-linux-gnu (64-bit)

  locale:
   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8              
   LC_COLLATE=en_US.UTF-8    
   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                   
   LC_NAME=C                 
   [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8  
   LC_IDENTIFICATION=C       

   attached base packages:
   [1] compiler  stats     graphics  grDevices utils     datasets  methods   base     

   other attached packages:
   [1] data.table_1.8.8 TTR_0.22-0       xts_0.9-3        zoo_1.7-9           
   timeDate_2160.97 Matrix_1.0-9     lattice_0.20-10 

   loaded via a namespace (and not attached):
   [1] grid_2.15.2  tools_2.15.2

I do it in this way finally:

Sys.setlocale("LC_COLLATE", "Chinese")
Sys.setlocale("LC_CTYPE", "Chinese")
Sys.setlocale("LC_MONETARY", "Chinese")
Sys.setlocale("LC_TIME", "Chinese")
Sys.setlocale("LC_MESSAGES", "Chinese")
Sys.setlocale("LC_MEASUREMENT", "Chinese")
PepsiCo
  • 1,399
  • 4
  • 13
  • 18

3 Answers3

9

You could utilize read.csv with encoding UTF-8:

df <-read.csv("data.csv", encoding="UTF-8", stringsAsFactors=FALSE)

to make the Chinese letters Characters and not Factors.

Note: I don't have the Chinese language pack installed in my environment so I can not determine if the garbled characters in the .csv you provided are corrupted or unrecognized.

flodel
  • 87,577
  • 21
  • 185
  • 223
KLDavenport
  • 659
  • 8
  • 24
  • I try to add `encoding = "UTF-8`(showed above), and the data can be showed, but not the Chinese word, does it mean that I need to install a Chinese language pack? – PepsiCo May 22 '13 at 01:43
  • @PepsiCo are you able to see the Chinese characters in Excel or anywhere else in your Operating System? – KLDavenport May 22 '13 at 03:21
  • yes, I can view the data when open the file with Excel. Could you view the file on `http://home.ustc.edu.cn/~lanrr/data.csv` – PepsiCo May 22 '13 at 04:24
  • I can see the file at that URL but I do not see the Chinese characters, do you see the Chinese characters? – KLDavenport May 22 '13 at 05:31
  • I try the function `Sys.setlocale` and it seems help. – PepsiCo May 22 '13 at 06:22
  • 1
    I'm guessing you used something like ```Sys.setlocale(category = "LC_ALL", locale = "")```? I'm glad you figured it out sorry I couldn't get you a quicker solution. – KLDavenport May 24 '13 at 21:45
9

First, that csv file in encoded in GBK not UTF-8, so the code should be:

mydata <- read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", 
                    encoding = "GBK", 
                    header = TRUE, 
                    stringsAsFactors = FALSE)

Second, if your env is not Chinese (Simplified), you should set_locale such as (my example os is windows 7)

Sys.setlocale(category = "LC_ALL", locale = "Chinese (Simplified)")

and then show the table with:

fix(mydata)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
Milton Wong
  • 139
  • 2
  • 5
2
  1. Download the file

    wget -O weirdo.csv http://home.ustc.edu.cn/~lanrr/data.csv
    
  2. In bash you can retrieve the file encoding with:

    $ file -i ./weirdo.csv
    
  3. Tell R how the file is encoded by pasting the output from charset= which could be for example charset=iso-8859-1

    read.csv("weirdo.csv", fileEncoding = "iso-8859-1")
    
zx8754
  • 52,746
  • 12
  • 114
  • 209
Mat D.
  • 453
  • 6
  • 15