Reading a TSV with specific encoding (initial two bytes and UTF-8 afterwards) and NUL after each character

Question

I have an obscure TSV that I'm trying to read and apparently it starts with a identifier and has some NUL values embedded (it seems that it's one NUL after each genuine character). These are the first 100 bytes of the file (shortened with a hex editor): test_file.txt (I had to rename it to txt in order to upload it, but it is a tsv file).

Unfortunately, I am not able to read it with the base functions, nor with readr or data.table.

Here is the reprex:

file <- 'test_file.txt'

# read.tsv is not able to read the file since there are embedded NULs
tmp <- read.table(file, header = T, nrows = 2)
#> Warning in read.table(file, header = T, nrows = 2): line 1 appears to
#> contain embedded nulls
#> Warning in read.table(file, header = T, nrows = 2): line 2 appears to
#> contain embedded nulls
#> Warning in read.table(file, header = T, nrows = 2): line 3 appears to
#> contain embedded nulls
#> Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
#> dec, : embedded nul(s) found in input

# unfortunately the skipNul argument also doesn't work
tmp <- read.table(file, header = T, nrows = 2, skipNul = T)
#> Error in read.table(file, header = T, nrows = 2, skipNul = T): more columns than column names

# read_tsv from readr is also not able to read the file (probably since it stops each line after a NUL)
tmp <- readr::read_tsv(file, n_max = 2)
#> Warning: Duplicated column names deduplicated: '' => '_1' [3], '' =>
#> '_2' [4], '' => '_3' [5], '' => '_4' [6], '' => '_5' [7], '' => '_6' [8],
#> '' => '_7' [9], '' => '_8' [10], '' => '_9' [11], '' => '_10' [12], '' =>
#> '_11' [13]
#> Parsed with column specification:
#> cols(
#>   y = col_character(),
#>   col_character(),
#>   `_1` = col_character(),
#>   `_2` = col_character(),
#>   `_3` = col_character(),
#>   `_4` = col_character(),
#>   `_5` = col_character(),
#>   `_6` = col_character(),
#>   `_7` = col_character(),
#>   `_8` = col_character(),
#>   `_9` = col_character(),
#>   `_10` = col_character(),
#>   `_11` = col_character()
#> )
#> Error in read_tokens_(data, tokenizer, col_specs, col_names, locale_, : Column 2 must be named

# fread from data.table is also not able to read the file (although it is the first function that more clearly shows the problem)
tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2): embedded nul in string: 'Ã¿Ã¾y\0e\0a\0r\0'

# read lines reads the first actual character 'y' and the file identifier characters that seem to parse as 'Ã¿Ã¾' in UTF-8
readLines(file, n = 1)
#> Warning in readLines(file, n = 1): line 1 appears to contain an embedded
#> nul
#> [1] "Ã¿Ã¾y"

# the problem is in the hidden NUL characters as the following command shows
readLines(file, n = 1, skipNul = T)
#> [1] "Ã¿Ã¾year\tmonth\tday\tDateTime\tAreaTypeCode\tAreaName\tMapCode\tPowerSystemResourceName\tProductionTypeName\tActualGenerationOutput\tActualConsumption\tInstalledGenCapacity\tSubmissionTS"

Is there a work-around that allows me to read this file? Preferably not by a base function since they are incredibly slow and I have to read multiple files (>20) of over 300 MB.

Can you upload your file as a [gist](https://gist.github.com/)? — MichaelChirico, Jan 09 '18 at 16:06
You mean other than the link to the txt and the code included in the issue? — takje, Jan 09 '18 at 16:12
Yes, I'm not sure uploading to GH didn't alter the raw contents of the file. I'm more confident copy-pasting to Gist wouldn't have this effect (as in particular, I'm getting a different error than you are) — MichaelChirico, Jan 09 '18 at 16:13
@MichaelChirico As what should I copy it? The hex dump? Because as soon as I copy parts of the file from a text editor, it seems to change the encoding which solves the problem. — takje, Jan 09 '18 at 16:28
@MichaelChirico As another test, I've opened the uploaded file and it seems to be the same as the original file (when opened with the hex editor). — takje, Jan 09 '18 at 16:29
linking the [GH issue for `data.table`](https://github.com/Rdatatable/data.table/issues/2560); I agree it's a bug — MichaelChirico, Jan 09 '18 at 16:34
Consider using the approach mentioned here as a holdover: https://stackoverflow.com/questions/34214859/removing-nul-characters-within-r — MichaelChirico, Jan 09 '18 at 16:34

score 1 · Answer 1 · answered Jan 10 '18 at 09:17

The current work-around is described in Removing "NUL" characters (within R)

This answer heavily relies on that one. I've added a few more comments, adapted the example to also work with the header bytes and added the usage of fread (data.table) and read_tsv (readr) to make the final link to a data frame.

file <- 'test_file.txt'

# read the file as raw and skip the first two header bytes
data_raw <- readBin(file, raw(), file.info(file)$size)[3:file.info(file)$size]

# replace the NUL values by an uncommon UTF-8 character so that we can
# later filter this one out. Please check out this list for more uncommon
# characters: http://www.fileformat.info/info/charset/UTF-8/list.htm
data_raw[data_raw == as.raw(0)] <- as.raw(1)

# convert to a string and remove the replaced characters (raw(1) in our case)
data_string <- gsub("\001", "", rawToChar(data_raw), fixed = TRUE)

# convert the resulting string to a data frame by a function to your liking
data_tmp1 <- data.table::fread(data_string, header = T) # quickest
data_tmp2 <- readr::read_tsv(data_string) # slower and is not working well with the UTF-8 characters
data_tmp3 <- read.table(data_string) # crashed R for my files (probably due to size)

This approach has been tested on relatively large files (350 MB). The first step (readBin) is the slowest, but takes only ~30s for this file size. The speed is probably also dependent on your hard drive.

For really large files, there could be a memory issue as well. In contrast to fread, everything will be read into the memory before doing any processing.

Reading a TSV with specific encoding (initial two bytes and UTF-8 afterwards) and NUL after each character

1 Answers1