non english characters in the R input

Question

My train data looks like this:

Hi I am trying to predict language of a set of names in different english and non english characters.

When I try to read the input using the R command:

 data = read.table("C:\\Users\\Sneha\\Documents\\study materials\\Independent Study\\train.txt",stringsAsFactors=FALSE,fileEncoding = "UTF-8")

I get the below error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 3 did not have 4 elements
In addition: Warning messages:
1: In read.table("C:\\Users\\Sneha\\Documents\\study materials\\Independent Study\\train.txt",  :
  invalid input found on input connection 'C:\Users\Sneha\Documents\study materials\Independent Study\train.txt'
2: In read.table("C:\\Users\\Sneha\\Documents\\study materials\\Independent Study\\train.txt",  :
  incomplete final line found by readTableHeader on 'C:\Users\Sneha\Documents\study materials\Independent Study\train.txt'

Can anyone suggest a better R command to read input of this sort.

Well, what encoding does your input file use? R can handle UTF-8 if your file is encoded in UTF-8 but the error message implies that's not the case. There is no way to tell for sure how the file is encoded by looking at it. That is something that the file creator must know. — MrFlick, Mar 23 '17 at 02:13
My notepad GUI indicates UTF-8 as the encoding type. I am not sure if this can be trusted. — codingyo, Mar 23 '17 at 02:23
Is your data delimited? Or do you have one record per row? `read.table` will separate based on white space by default. A [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would be helpful so we know what your input looks like. — MrFlick, Mar 23 '17 at 02:28

score 0 · Answer 1 · answered Mar 23 '17 at 15:40

This worked for me.

rm(list=ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
entities[4,4]
as.character(entities[22,4])
entities <- read_delim(filename, 
+     "\t", escape_double = FALSE, trim_ws = TRUE)

when i viewed the encoded data:

entities[4,4]
# A tibble: 1 × 1
                                                                             IGNORE
                                                                              <chr>
1 <U+0411><U+0438><U+043B><U+043B>+<U+0413><U+043E><U+0440><U+0442><U+043D><U+0438>

non english characters in the R input

1 Answers1