I'm reading in a large (> 5GB) csv file into R. The csv file is written in UTF-16 format.
Most functions that can efficiently handle reading in large files(fread, read_delim) will not work with UTF-16.
read.csv allows you to define encoding and will read in UTF-16, the issue that it's a slow process on these very large files.
I'm making it work with read.csv(), see code below. But I'm curious if anyone knows of more efficient means of reading in UTF-16 data in R.
### iterative process using read.csv https://stackoverflow.com/questions/9352887/strategies-for-reading-in-csv-files-in-pieces/30403877#30403877
# establishing a connection to the file
con <- file(csvPath, "r", encoding = 'UTF-16')
#close(con)
# create a dataframe to bind outputs to
df2 <- data.frame()
rows <- 10000
x =1
while(rows ==10000){
df <- read.csv(con,header = FALSE,fileEncoding = 'UTF-16',sep = '\t',nrows = 10000)
rows <- nrow(df)
print(rows)
colnames(df) <- names(header)
dataThin <- df %>%
dplyr::select("gbifID", "genus", "species", "infraspecificEpithet", "taxonRank",
"countryCode", "locality", "stateProvince", "decimalLatitude",
"decimalLongitude", "basisOfRecord", "institutionCode" )
df2 <- rbind(df2, dataThin)
x = x+1
print(x)
}