reading Hebrew language read.csv (mixed problem)

Question

I have an amount of 1000 csv files which contains Hebrew.

I'm trying to import them into R but there is a problem reading Hebrew into the program.

When using this, I get arount 80% of the files with correct hebrew but other 20% not:

data_lst <- lapply(files_to_read,function(i){
  read.csv(i, encoding = "UTF-8")
})

When using this, I get the other 20% right but the 80% that worked before does not work here:

data_lst <- lapply(files_to_read,function(i){
  read.csv(i, encoding = 'utf-8-sig')
})

I'm unable to use read_csv from library(readr) and have to stay with the format of read.csv.

Thank you for you help!

SamR · Answer 1 · 2023-02-21T09:53:30.537

It sounds like you have two different file encodings, utf-8 and utf-8-sig. The latter has a Byte Order Mark of 0xef, 0xbb, 0xbf at the start indicating the encoding.

I wrote the iris dataset to csv in both encodings - the only difference is the first line.

UTF-8:

sepal.length,sepal.width,petal.length,petal.width,species

UTF-8-SIG:

ï»¿sepal.length,sepal.width,petal.length,petal.width,species

In your case, it sounds like R is not detecting the encodings correctly, but using encoding="utf-8" works for some files, and encoding="utf-8-sig" works for the other files. The natural course of action seems to me to be to read in the first line and see if it has that pattern at the start:

BOM_pattern <- "^ï»¿"

encodings <- vapply(
    files_to_read,
    \(file) {
        line <- readLines(file, n = 1L, encoding = "utf-8")
        ifelse(grepl(BOM_pattern, line), "utf-8-sig", "utf-8")
    },
    character(1)
)

This will return a (named) character vector of c("utf-8", "utf-8-sig") as appropriate. You can then supply the encoding to read.csv:

data_lst <- Map(
    \(file, encoding) read.csv(file, encoding = encoding),
    files_to_read,
    encodings
)

This should read in each data frame with the correct encoding and store them in the list data_lst.

Edit

There may be extra considerations owing to the right-to-left-reading order. It seems that R whether the caret (^) applies as a start of sentence delimiter depends on whether all the letters in the string are Hebrew letters or not. For example:

pattern  <- "^ז"
grepl(pattern, "זה משפט בעברית") # TRUE
grepl(pattern, "AAA זה משפט בעברית") # FALSE

This may be obvious to regular users of Hebrew but is news to me and could cause additional complications. If the pattern is not always recognised when you think it should be, you can just remove the caret from the pattern:

BOM_pattern <- "ï»¿"

The only exception to this would be if you expect to see this string of characters in one of your column names.

reading Hebrew language read.csv (mixed problem)

1 Answers1

UTF-8:

UTF-8-SIG: