Just like the title says...how can I embed a code into my script to change all files in the folder from utf16 to utf8 to make it easier for R to read?
1 Answers
R can read utf-16
files
R is able to read utf-16
files, so it may not be necessary to convert them. However, to answer the question, here is:
- An R function to copy all
utf-16
encoded files in a directory to a new folder withutf-8
encoding. This reads each file into RAM at once, which may be problematic with large files. - A list of native OS approaches to changing the encoding for a directory of files (both Linux/Mac and Windows):
- Linux/Mac: a
bash
script usingiconv
which converts file streams, i.e. avoids storing the whole file in memory at once. - Windows: a pure PowerShell approach using
Get-Content
, which reads the entire file at once. - Windows: a PowerShell approach with some embedded C# which streams the file line by line.
- Linux/Mac: a
- A summary of how to read a
utf-16
csv file into R usingbase
R,data.table
andtidyverse
. If you just want to read the files into R, there's no need to copy them, and this is probably the right approach.
1. R function to change file encoding
You can write an R function to read in a file in utf-16
and then write it out in utf-8
:
convert_file_to_utf8 <- function(in_file, out_file, encoding = "utf-16") {
in_file_conn <- file(in_file, encoding = encoding)
txt <- readLines(in_file_conn)
close(in_file_conn)
# Create out directory
if (!dir.exists(dirname(out_file))) dir.create(dirname(out_file))
# Write file with new encoding
out_file_conn <- file(out_file, encoding = "utf-8")
writeLines(txt, out_file_conn)
close(out_file_conn)
}
If you want to do this to an entire directory then you can write another function to call this function:
create_utf8_dir <- function(in_dir = "./utf16dir/", out_dir = "./utf8dir/") {
files <- dir(in_dir, full.names = TRUE)
for (in_file in files) {
out_file <- sub(in_dir, out_dir, in_file, fixed = TRUE)
convert_file_to_utf8(in_file, out_file)
}
}
Running create_utf8_dir()
will copy the utf-16
encoded contents of the directory "./utf16dir/"
to a directory called "./utf8dir/"
(which it will create if it does not exist).
2. Native OS approaches to changing a file encoding
However, if the files are large, an approach which reads in the entirety of each file at once will use a lot of RAM.
2.1 bash
If you are use Linux/Mac I would use iconv
which can change a file encoding while streaming the file, i.e. never keeping the entire file contents in RAM. For one file you can do:
iconv -f UTF-16 -t UTF-8 mtcars.csv > mtcars_utf8.csv
To convert all files in the directory ./utf16dir
into ./utf8dir
:
IN_ENCODING=UTF16
OUT_ENCODING=UTF8
OUT_DIR=utf8dir
for f in ./utf16dir/*; do
basename="$(basename ${f%.*})"
extension=${f##*.}
outfile="./$OUT_DIR/$basename$OUT_ENCODING.$extension"
echo $outfile
iconv -f $IN_ENCODING -t $OUT_ENCODING $f > $outfile
done
2.2 PowerShell
2.2.1 Pure PowerShell
If you are using Windows, you can use the following pattern:
(Get-Content -path mtcars.csv) | Set-Content -Encoding ASCII -Path mtcarsutf8.csv
To abstract this to all files in a folder:
$in_dir = "./utf16dir/"
$out_dir = "./utf8dir/"
If (!(test-path -PathType container $out_dir)) {
New-Item -ItemType Directory -Path $out_dir
}
Get-ChildItem $in_dir |
Foreach-Object {
$outfile = $out_dir + $_.BaseName + "_utf8" + $_.Extension
Write-Output $outfile
(Get-Content -path $_.FullName) | Set-Content -Encoding ASCII -Path $outfile
}
Again this will copy the utf-16
encoded contents of the directory "./utf16dir/"
to a directory called "./utf8dir/"
(which it will create if it does not exist), appending "_utf8"
to the file names.
There are drawbacks to this approach:
- I set the encoding to ASCII which is a subset of
utf-8
. That's OK here as I know all the characters are ASCII characters. If that's not the case, you can changeASCII
toUTF-8
. However, Windows usesutf-8-bom
. It is not entirely straightforward to remove the remove the Byte Order Mark (BOM) - see here if you have non-ASCII characters. - This reads the entire file into RAM at once, like the R approach.
2.2.2 Powershell with embedded C#
You can overcome both of these limitations by using C# within Powershell to read a utf-16
encoded file line by line, and then write out a utf-8
file:
$code = @"
using System;
using System.IO;
namespace ProcessLargeFile
{
public class Program
{
static void ConvertLine(string line, StreamWriter sw)
{
sw.WriteLine(line);
}
public static void ConvertFile(string path, string inDir, string outDir) {
StreamReader sr = new StreamReader(File.Open(path, FileMode.Open));
string outPath = path.Replace(inDir, outDir);
Console.WriteLine(outPath);
StreamWriter sw = new StreamWriter(File.Open(outPath, System.IO.FileMode.Append));
try {
while (!sr.EndOfStream){
string line = sr.ReadLine();
ConvertLine(line, sw);
}
} finally {
sr.Close();
sw.Close();
}
}
static void ConvertDir(string inDir, string outDir) {
string[] filePaths = Directory.GetFiles(inDir);
Directory.CreateDirectory(outDir);
foreach(string file in filePaths)
{
ConvertFile(file, inDir, outDir);
}
}
public static void Main(string[] args){
string inDir = args[0];
string outDir = args[1];
ConvertDir(inDir, outDir);
}
}
}
"@
Add-Type -TypeDefinition $code -Language CSharp
[ProcessLargeFile.Program]::Main(@("utf16dir/", "utf8dir/"))
Again this copies the content of "utf16dir/"
to "utf8dir/"
. You can change the input and output directories by changing the arguments in the final line. This approach streams the files and writes out pure utf-8
(with no BOM).
3. base R, data.table and tidyverse methods to read a utf-16 file
In your question you state you wish to change the encoding in order to make it easier for R to read the files. R is able to read utf-16
files, provided you set the encoding
argument when you create the file connection with file()
.
I'll set out here how to read a utf-16
csv file with base R and popular alternatives. Assume for example that you are trying to read the following file:
in_file <- "./utf16dir/mtcars.csv"
base R
in_file_conn <- file(in_file, encoding = encoding)
read.csv(text = readLines(in_file_conn))
data.table
in_file_conn <- file(in_file, encoding = encoding)
data.table::fread(
text = readLines(in_file_conn)
)
readr
readr::read_csv(
in_file,
locale = readr::locale(encoding = "utf-16")
)
Depending on your ultimate goal, rather than copying all the files in the directory, you may simply wish to read in the utf-16
encoded files.

- 8,826
- 3
- 11
- 33