Importing data with Vietnamese characters into R

Question

Please can someone suggest the best way to import data with Vietnamese characters into an R dataframe, so that the data are depicted correctly. The kind of data I need to import includes a longer version of the column below:

Student_name

PHẠM THANH

PHẠM VĂN

NGUYỄN TUẤN

NGUYỄN VĂN

VŨ NGỌC

I tried many options including saving the data as Unicode.txt and importing into R with encoding = UTF-8 specified.

With read.csv or read.table, I get the error message

In read.table("Stu.txt", header = TRUE, encoding = "UTF-8") : line 1 appears to contain embedded nulls

Saving as an MS-Excel file and importing with read.xlsx (package xlsx), I can read the data alright, without specifying encoding I get weird output, as shown:

 Student_name

1 PHáº M THANH

2 PHáº M VÄ‚N

3 NGUYá»„N TUáº¤N

4 NGUYá»„N VÄ‚N

5 NGUYá»„N VÄ‚N

6 VÅ¨ NGá»ŒC

With read.xlsx, and encoding="UTF-8", I get the UTF-8 translation alright, but without hex codes, so the output has the names enclosed in less than and greater than signs PH <'U+1EA0'>M THANH and so on, without the quotation marks.

I am running R through RStudio,Version 0.99.467, with Windows 7 operating system.

Thank you.

So you are saying that your .xlsx file is XML, which is usually implictly UTF8 coded without a BOM and that you looked inside the .xlsx and you know what XML charset is specified? Are you saying your problem is an R problem, or your problem is an Excel problem? You should say exactly what is in the xlsx file and what leads you to believe you have an R problem. — Warren P, Dec 12 '15 at 15:22
The .xlsx file is a default .xlsx. I also saved the column separately as a Unicode .txt file, and as a .csv file This file I can open in Notepad or MS-Excel with the correct depiction of characters. I can even copy and paste the column into my RStudio script window and read with textConnection, but the problem persists, so is surely a problem with commands I need to give to R to decode the characters adequately. — Suhas, Dec 12 '15 at 16:41
Do you understand what I said? Do you know how to examine a file and determine it's actual encoding? — Warren P, Dec 12 '15 at 16:47
Please see the following: http://stackoverflow.com/questions/22876746/how-to-read-data-in-utf-8-format-in-r — Raad, Dec 12 '15 at 17:03
I am afraid I did not understand what you said and your question either. Perhaps the example below will explain.library(data.table) toread <- "PHẠM THANH PHẠM VĂN NGUYỄN TUẤN NGUYỄN VĂN NGUYỄN VĂN VŨ NGỌC" — Suhas, Dec 12 '15 at 17:14
Thanks, I did read the question 22876746 on Chinese characters, but it did not help. I don't understand @Warrenp 's question about the file either, because if I try to read the data inside the script, with DT <- read.table(textConnection(toread),header=FALSE,encoding="UTF-8"), the name PHẠM reads as PHM. — Suhas, Dec 12 '15 at 17:25

score 0 · Answer 1 · edited Aug 24 '20 at 13:14

0

I used the stri_trans_general function from the stringi package:

data <- read.table("Stu.txt", header = TRUE, encoding = "UTF-8")  %>% 
         mutate(Student_name = stri_trans_general(Student_name, "Latin-ASCII"))

edited Aug 24 '20 at 13:14

Adriaan

17,741
7
42
75

answered Aug 24 '20 at 11:52

drhnis

113
1
2

Importing data with Vietnamese characters into R

1 Answers1