functions for reading in csv with UTF-16 format in R

Question

I'm reading in a large (> 5GB) csv file into R. The csv file is written in UTF-16 format.

Most functions that can efficiently handle reading in large files(fread, read_delim) will not work with UTF-16.

read.csv allows you to define encoding and will read in UTF-16, the issue that it's a slow process on these very large files.

I'm making it work with read.csv(), see code below. But I'm curious if anyone knows of more efficient means of reading in UTF-16 data in R.

### iterative process using read.csv https://stackoverflow.com/questions/9352887/strategies-for-reading-in-csv-files-in-pieces/30403877#30403877
# establishing a connection to the file  
con <- file(csvPath, "r", encoding = 'UTF-16')
#close(con)
# create a dataframe to bind outputs to 
df2 <- data.frame()

rows <- 10000
x =1 
while(rows ==10000){
  df <- read.csv(con,header = FALSE,fileEncoding = 'UTF-16',sep = '\t',nrows = 10000)
  rows <- nrow(df)
  print(rows)
  colnames(df) <- names(header)
  dataThin <- df %>%
    dplyr::select("gbifID", "genus", "species", "infraspecificEpithet", "taxonRank",
                  "countryCode", "locality", "stateProvince", "decimalLatitude", 
                  "decimalLongitude", "basisOfRecord", "institutionCode" )
  df2 <- rbind(df2, dataThin)
  x = x+1 
  print(x)
}

functions for reading in csv with UTF-16 format in R

0 Answers0