I have an obscure TSV that I'm trying to read and apparently it starts with a identifier and has some NUL values embedded (it seems that it's one NUL after each genuine character). These are the first 100 bytes of the file (shortened with a hex editor): test_file.txt (I had to rename it to txt in order to upload it, but it is a tsv file).
Unfortunately, I am not able to read it with the base functions, nor with readr or data.table.
Here is the reprex:
file <- 'test_file.txt'
# read.tsv is not able to read the file since there are embedded NULs
tmp <- read.table(file, header = T, nrows = 2)
#> Warning in read.table(file, header = T, nrows = 2): line 1 appears to
#> contain embedded nulls
#> Warning in read.table(file, header = T, nrows = 2): line 2 appears to
#> contain embedded nulls
#> Warning in read.table(file, header = T, nrows = 2): line 3 appears to
#> contain embedded nulls
#> Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
#> dec, : embedded nul(s) found in input
# unfortunately the skipNul argument also doesn't work
tmp <- read.table(file, header = T, nrows = 2, skipNul = T)
#> Error in read.table(file, header = T, nrows = 2, skipNul = T): more columns than column names
# read_tsv from readr is also not able to read the file (probably since it stops each line after a NUL)
tmp <- readr::read_tsv(file, n_max = 2)
#> Warning: Duplicated column names deduplicated: '' => '_1' [3], '' =>
#> '_2' [4], '' => '_3' [5], '' => '_4' [6], '' => '_5' [7], '' => '_6' [8],
#> '' => '_7' [9], '' => '_8' [10], '' => '_9' [11], '' => '_10' [12], '' =>
#> '_11' [13]
#> Parsed with column specification:
#> cols(
#> y = col_character(),
#> col_character(),
#> `_1` = col_character(),
#> `_2` = col_character(),
#> `_3` = col_character(),
#> `_4` = col_character(),
#> `_5` = col_character(),
#> `_6` = col_character(),
#> `_7` = col_character(),
#> `_8` = col_character(),
#> `_9` = col_character(),
#> `_10` = col_character(),
#> `_11` = col_character()
#> )
#> Error in read_tokens_(data, tokenizer, col_specs, col_names, locale_, : Column 2 must be named
# fread from data.table is also not able to read the file (although it is the first function that more clearly shows the problem)
tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2): embedded nul in string: 'ÿþy\0e\0a\0r\0'
# read lines reads the first actual character 'y' and the file identifier characters that seem to parse as 'ÿþ' in UTF-8
readLines(file, n = 1)
#> Warning in readLines(file, n = 1): line 1 appears to contain an embedded
#> nul
#> [1] "ÿþy"
# the problem is in the hidden NUL characters as the following command shows
readLines(file, n = 1, skipNul = T)
#> [1] "ÿþyear\tmonth\tday\tDateTime\tAreaTypeCode\tAreaName\tMapCode\tPowerSystemResourceName\tProductionTypeName\tActualGenerationOutput\tActualConsumption\tInstalledGenCapacity\tSubmissionTS"
Is there a work-around that allows me to read this file? Preferably not by a base function since they are incredibly slow and I have to read multiple files (>20) of over 300 MB.