5

Just like the title says...how can I embed a code into my script to change all files in the folder from utf16 to utf8 to make it easier for R to read?

jc2525
  • 141
  • 10

1 Answers1

15

R can read utf-16 files

R is able to read utf-16 files, so it may not be necessary to convert them. However, to answer the question, here is:

  1. An R function to copy all utf-16 encoded files in a directory to a new folder with utf-8 encoding. This reads each file into RAM at once, which may be problematic with large files.
  2. A list of native OS approaches to changing the encoding for a directory of files (both Linux/Mac and Windows):
    1. Linux/Mac: a bash script using iconv which converts file streams, i.e. avoids storing the whole file in memory at once.
    2. Windows: a pure PowerShell approach using Get-Content, which reads the entire file at once.
    3. Windows: a PowerShell approach with some embedded C# which streams the file line by line.
  3. A summary of how to read a utf-16 csv file into R using base R, data.table and tidyverse. If you just want to read the files into R, there's no need to copy them, and this is probably the right approach.

1. R function to change file encoding

You can write an R function to read in a file in utf-16 and then write it out in utf-8:

convert_file_to_utf8 <- function(in_file, out_file, encoding = "utf-16") {
    in_file_conn <- file(in_file, encoding = encoding)
    txt <- readLines(in_file_conn)
    close(in_file_conn)

    # Create out directory
    if (!dir.exists(dirname(out_file))) dir.create(dirname(out_file))

    # Write file with new encoding
    out_file_conn <- file(out_file, encoding = "utf-8")
    writeLines(txt, out_file_conn)
    close(out_file_conn)
}

If you want to do this to an entire directory then you can write another function to call this function:

create_utf8_dir <- function(in_dir = "./utf16dir/", out_dir = "./utf8dir/") {
    files <- dir(in_dir, full.names = TRUE)
    for (in_file in files) {
        out_file <- sub(in_dir, out_dir, in_file, fixed = TRUE)
        convert_file_to_utf8(in_file, out_file)
    }
}

Running create_utf8_dir() will copy the utf-16 encoded contents of the directory "./utf16dir/" to a directory called "./utf8dir/" (which it will create if it does not exist).

2. Native OS approaches to changing a file encoding

However, if the files are large, an approach which reads in the entirety of each file at once will use a lot of RAM.

2.1 bash

If you are use Linux/Mac I would use iconv which can change a file encoding while streaming the file, i.e. never keeping the entire file contents in RAM. For one file you can do:

iconv -f UTF-16 -t UTF-8 mtcars.csv > mtcars_utf8.csv

To convert all files in the directory ./utf16dir into ./utf8dir:

IN_ENCODING=UTF16
OUT_ENCODING=UTF8
OUT_DIR=utf8dir
for f in ./utf16dir/*; do
    basename="$(basename ${f%.*})"
    extension=${f##*.}  
    outfile="./$OUT_DIR/$basename$OUT_ENCODING.$extension"
    echo $outfile
    iconv -f $IN_ENCODING -t $OUT_ENCODING $f > $outfile
done

2.2 PowerShell

2.2.1 Pure PowerShell

If you are using Windows, you can use the following pattern:

(Get-Content -path mtcars.csv) | Set-Content -Encoding ASCII -Path mtcarsutf8.csv

To abstract this to all files in a folder:

$in_dir = "./utf16dir/"
$out_dir = "./utf8dir/"

If (!(test-path -PathType container $out_dir)) {
    New-Item -ItemType Directory -Path $out_dir
}
Get-ChildItem $in_dir | 
Foreach-Object {
    $outfile = $out_dir + $_.BaseName + "_utf8" + $_.Extension
    Write-Output $outfile
    (Get-Content -path $_.FullName) | Set-Content -Encoding ASCII -Path $outfile
}

Again this will copy the utf-16 encoded contents of the directory "./utf16dir/" to a directory called "./utf8dir/" (which it will create if it does not exist), appending "_utf8" to the file names.

There are drawbacks to this approach:

  1. I set the encoding to ASCII which is a subset of utf-8. That's OK here as I know all the characters are ASCII characters. If that's not the case, you can change ASCII to UTF-8. However, Windows uses utf-8-bom. It is not entirely straightforward to remove the remove the Byte Order Mark (BOM) - see here if you have non-ASCII characters.
  2. This reads the entire file into RAM at once, like the R approach.

2.2.2 Powershell with embedded C#

You can overcome both of these limitations by using C# within Powershell to read a utf-16 encoded file line by line, and then write out a utf-8 file:

$code = @"
using System;
using System.IO;
namespace ProcessLargeFile
{
    public class Program
    {
        static void ConvertLine(string line, StreamWriter sw)
        {
            sw.WriteLine(line);
        }
        public static void ConvertFile(string path, string inDir, string outDir) {
            StreamReader sr = new StreamReader(File.Open(path, FileMode.Open));
            string outPath = path.Replace(inDir, outDir);
            Console.WriteLine(outPath);
            StreamWriter sw = new StreamWriter(File.Open(outPath, System.IO.FileMode.Append));
            try {
                while (!sr.EndOfStream){
                    string line = sr.ReadLine();
                    ConvertLine(line, sw);
                }
            } finally {
                sr.Close();
                sw.Close();
            }
        }
        static void ConvertDir(string inDir, string outDir) {
            string[] filePaths = Directory.GetFiles(inDir);
            Directory.CreateDirectory(outDir);
            foreach(string file in filePaths)
            {
                ConvertFile(file, inDir, outDir);
            }
        }
        public static void Main(string[] args){
            string inDir = args[0];
            string outDir = args[1];
            ConvertDir(inDir, outDir);
        }
    }
}
"@
Add-Type -TypeDefinition $code -Language CSharp
[ProcessLargeFile.Program]::Main(@("utf16dir/", "utf8dir/"))

Again this copies the content of "utf16dir/" to "utf8dir/". You can change the input and output directories by changing the arguments in the final line. This approach streams the files and writes out pure utf-8 (with no BOM).

3. base R, data.table and tidyverse methods to read a utf-16 file

In your question you state you wish to change the encoding in order to make it easier for R to read the files. R is able to read utf-16 files, provided you set the encoding argument when you create the file connection with file().

I'll set out here how to read a utf-16 csv file with base R and popular alternatives. Assume for example that you are trying to read the following file:

in_file <- "./utf16dir/mtcars.csv"

base R

in_file_conn <- file(in_file, encoding = encoding)
read.csv(text = readLines(in_file_conn))

data.table

in_file_conn <- file(in_file, encoding = encoding)
data.table::fread(
    text = readLines(in_file_conn)
)

readr

readr::read_csv(
    in_file,
    locale = readr::locale(encoding = "utf-16")
)

Depending on your ultimate goal, rather than copying all the files in the directory, you may simply wish to read in the utf-16 encoded files.

SamR
  • 8,826
  • 3
  • 11
  • 33