1

I am trying to read a CSV file containing texts in many different characters using the function read.csv. This is a sample of the file content:

device,country_code,keyword,indexed_clicks,indexed_cost
Mobile,JP,お金 借りる,5.913037843442198,103.05985173478956
Desktop,US,email,82.450427682737157,81.871030974598241
Desktop,US,news,414.14755054432345,66.502397615344861
Mobile,JP,ヤフートラベル,450.9622861586314,55.733902871922957

If I use the next function to read the data:

texts <- read.csv("text.csv", sep = ",", header = TRUE)

The dataframe is imported to R, but the characters are not well saved...

   device country_code               keyword indexed_clicks indexed_cost
1  Mobile           JP ã\u0081Šé‡‘ 借りる       5.913038    103.05985
2 Desktop           US                 email      82.450428     81.87103
3 Desktop           US                  news     414.147551     66.50240
4  Mobile           JP ヤフートラベル     450.962286     55.73390

If I use the next function (same as before with fileEncoding="UTF-8"):

texts <- read.csv("text.csv", sep = ",", header = TRUE, fileEncoding = "utf-8")

I get the next warning message:

Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  entrada inválida encontrada en la conexión de entrada 'text.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'text.csv'

Anyone knows how to read properly this file?

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • 1
    The warnings are just warnings: your `read.csv` call succeeded. Did it read the Japanese characters correctly? If not, you need to choose a different `fileEncoding` setting. If so, some lines in your file are not entered properly. – user2554330 Oct 21 '21 at 10:09
  • Even though it's in Spanish the warrning message seems to suggest the file's last line is incomplete. `ヤフ..` is how a UTF8 file would appear if you tried to read it with a single-byte encoding like Latin1. – Panagiotis Kanavos Oct 21 '21 at 10:09
  • What does the dataframe look like after you load it with UTF8? Are there any missing or extra collumns? The only thing that's certain is that this is a UTF8 file – Panagiotis Kanavos Oct 21 '21 at 10:12
  • @PanagiotisKanavos: I think the two warning messages are different. The Spanish one is a translation of `invalid input found on input connection`, not the incomplete line warning in the second one. It's probably a character that's not legal in UTF-8. – user2554330 Oct 21 '21 at 11:27
  • It may be a UTF8 file with a BOM. This can be read with `UTF-8-BOM`. – Panagiotis Kanavos Oct 21 '21 at 11:34

1 Answers1

0

I replicated your problem with both:

texts <- read.csv("text.csv", sep = ",", header = TRUE)

and

texts_ <- read.csv("text.csv", sep = ",", header = TRUE, encoding = "utf-8")

and both works perfectly fine (R Studio V1.4.1717, Ubuntu 20.04.3 LTS). Some possibilities I can think of:

  1. The csv file wasn't saved properly as UTF-8 or corrupted. Have you checked the file again?
  2. If you are using Windows, try using encoding instead of fileEncoding. These problems happen with non-standard characters (Windows Encoding Hell).
Dharman
  • 30,962
  • 25
  • 85
  • 135
Fariz Awi
  • 43
  • 6
  • 1
    The file *is* UTF8. `ヤフーã` is how UTF8 characters appear if they're read or displayed using a single-byte codepage. There are no "non-standard encodings". Window is a Unicode operating system anyway - all strings are Unicode. The system locale is used by *non*-Unicode applications. It's *R* or more precisely, some R packages that have trouble with Unicode because they weren't compiled for Unicode. Setting `LC_ALL` was enough if you only had to process files from your country. Once US/UK data scientists had to process *multiple* encodings thoughg, eg Cyrillic, Chinese .... Ooops! – Panagiotis Kanavos Oct 21 '21 at 10:37
  • 2
    R fixed the problem years ago by allowing you to specify the encoding in any locale-sensitive method. RStudio took a bit longer. Some packages though still assume there's only one encoding in the world that only a single encoding will ever be used on a machine, or that it's acceptable to modify LC_ALL every time you want to load a file with a different encoding – Panagiotis Kanavos Oct 21 '21 at 10:39