3

Please can someone suggest the best way to import data with Vietnamese characters into an R dataframe, so that the data are depicted correctly. The kind of data I need to import includes a longer version of the column below:

Student_name

PHẠM THANH

PHẠM VĂN

NGUYỄN TUẤN

NGUYỄN VĂN

VŨ NGỌC

I tried many options including saving the data as Unicode.txt and importing into R with encoding = UTF-8 specified.

With read.csv or read.table, I get the error message

In read.table("Stu.txt", header = TRUE, encoding = "UTF-8") : line 1 appears to contain embedded nulls

Saving as an MS-Excel file and importing with read.xlsx (package xlsx), I can read the data alright, without specifying encoding I get weird output, as shown:

 Student_name

1 PHẠM THANH

2 PHẠM VĂN

3 NGUYỄN TUẤN

4 NGUYỄN VĂN

5 NGUYỄN VĂN

6 VŨ NGỌC

With read.xlsx, and encoding="UTF-8", I get the UTF-8 translation alright, but without hex codes, so the output has the names enclosed in less than and greater than signs PH <'U+1EA0'>M THANH and so on, without the quotation marks.

I am running R through RStudio,Version 0.99.467, with Windows 7 operating system.

Thank you.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Suhas
  • 31
  • 2
  • 1
    So you are saying that your .xlsx file is XML, which is usually implictly UTF8 coded without a BOM and that you looked inside the .xlsx and you know what XML charset is specified? Are you saying your problem is an R problem, or your problem is an Excel problem? You should say exactly what is in the xlsx file and what leads you to believe you have an R problem. – Warren P Dec 12 '15 at 15:22
  • The .xlsx file is a default .xlsx. I also saved the column separately as a Unicode .txt file, and as a .csv file This file I can open in Notepad or MS-Excel with the correct depiction of characters. I can even copy and paste the column into my RStudio script window and read with textConnection, but the problem persists, so is surely a problem with commands I need to give to R to decode the characters adequately. – Suhas Dec 12 '15 at 16:41
  • Do you understand what I said? Do you know how to examine a file and determine it's actual encoding? – Warren P Dec 12 '15 at 16:47
  • Please see the following: http://stackoverflow.com/questions/22876746/how-to-read-data-in-utf-8-format-in-r – Raad Dec 12 '15 at 17:03
  • I am afraid I did not understand what you said and your question either. Perhaps the example below will explain.library(data.table) toread <- "PHẠM THANH PHẠM VĂN NGUYỄN TUẤN NGUYỄN VĂN NGUYỄN VĂN VŨ NGỌC" – Suhas Dec 12 '15 at 17:14
  • Thanks, I did read the question 22876746 on Chinese characters, but it did not help. I don't understand @Warrenp 's question about the file either, because if I try to read the data inside the script, with DT <- read.table(textConnection(toread),header=FALSE,encoding="UTF-8"), the name PHẠM reads as PHM. – Suhas Dec 12 '15 at 17:25

1 Answers1

0

I used the stri_trans_general function from the stringi package:

data <- read.table("Stu.txt", header = TRUE, encoding = "UTF-8")  %>% 
         mutate(Student_name = stri_trans_general(Student_name, "Latin-ASCII"))
Adriaan
  • 17,741
  • 7
  • 42
  • 75
drhnis
  • 113
  • 1
  • 2