It sounds like you have two different file encodings, utf-8 and utf-8-sig. The latter has a Byte Order Mark
of 0xef, 0xbb, 0xbf
at the start indicating the encoding.
I wrote the iris
dataset to csv in both encodings - the only difference is the first line.
UTF-8:
sepal.length,sepal.width,petal.length,petal.width,species
UTF-8-SIG:
sepal.length,sepal.width,petal.length,petal.width,species
In your case, it sounds like R is not detecting the encodings correctly, but using encoding="utf-8"
works for some files, and encoding="utf-8-sig"
works for the other files. The natural course of action seems to me to be to read in the first line and see if it has that pattern at the start:
BOM_pattern <- "^"
encodings <- vapply(
files_to_read,
\(file) {
line <- readLines(file, n = 1L, encoding = "utf-8")
ifelse(grepl(BOM_pattern, line), "utf-8-sig", "utf-8")
},
character(1)
)
This will return a (named) character vector of c("utf-8", "utf-8-sig")
as appropriate. You can then supply the encoding to read.csv
:
data_lst <- Map(
\(file, encoding) read.csv(file, encoding = encoding),
files_to_read,
encodings
)
This should read in each data frame with the correct encoding and store them in the list data_lst
.
Edit
There may be extra considerations owing to the right-to-left-reading order. It seems that R whether the caret (^
) applies as a start of sentence delimiter depends on whether all the letters in the string are Hebrew letters or not. For example:
pattern <- "^ז"
grepl(pattern, "זה משפט בעברית") # TRUE
grepl(pattern, "AAA זה משפט בעברית") # FALSE
This may be obvious to regular users of Hebrew but is news to me and could cause additional complications. If the pattern is not always recognised when you think it should be, you can just remove the caret from the pattern:
BOM_pattern <- ""
The only exception to this would be if you expect to see this string of characters in one of your column names.