1

I have a huge (39 GB) text file that I eventually need to read into R as a pipe-delimited file. However, there are lots of NUL characters \0 that do not read in R. I'm trying to replace them in PowerShell beforehand.

PowerShell code:

Get-Content file.txt | foreach { $_ -replace '\\0' } | Out-File -Encoding UTF8 file_NEW.txt

I thought this worked but when I try to read the new file in R, \0 characters appear in the string and I get this error:

Error in vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, : 
  embedded nul in string: '||MORALES BELINDA F TRUST||\0||\0|0||PT||||33.824049|-118.192053||3655||N|WESTON|PL|||LONG BEACH|CA|908073855|C033||3655||N|WESTON|PL|||LONG BEACH|CA|908073855|C033|20111117|988||||||||20111027|20111110|TR|2|1527575||KINECTA FCU|KINECTA FCU|||MANHATTAN BEACH|CA|90266|047978|LAWYERS TITLE|03003|232000.00|20111027||CNV|TR|Y|10|20211201|||D|BGJT|V||115|21||0|0||0|||||Y|Y||\r\n06037|5054001029|5054-001-029|1|\0||BONILLA ...

Why are there still NULs in the file? ANY help appreciated! Especially because these functions take so long to run. Please, I'm just trying to read this huge file.

Just in case there is an error in the R code, note it is taken directly from this post using the vroom and arrow packages to read then create parquet files.

r2evans
  • 141,215
  • 6
  • 77
  • 149
Kate Ham
  • 143
  • 6

3 Answers3

3

The reason is that doubling the backslash, \\, means that the backspace is escaped. Instead of looking at NUL (0x00), you are looking literally \0 - two characters.

The correct syntax would be like so,

-replace '\0'

That being said, processing a large file can be done smarter a way. A fast way would be to process, say, 10 000 lines a time. See earlier an answer about how to process a file in blocks.

vonPryz
  • 22,996
  • 7
  • 54
  • 65
1

I don't know powershell enough to fix that, but you can use sed or tr to replace the nuls in the files. The tr and sed utilities are available by default on most (all?) unix-like OSes including macos. For windows, they are included in Rtools35 and Rtools40.

If you do not find it with Sys.which("tr"), then you may need to include the full path to the respective utility. Assuming Rtools is installed on the root c:/, then something like

  • Rtools35: c:/Rtools/bin/tr.exe
  • Rtools40: c:/Rtools40/usr/bin/tr.exe

They are also included in Git-for-Windows as /usr/bin/tr.exe and /usr/bin/sed.exe within git-bash. (On the file-system, they are likely under c:/Program Files/Git/usr/bin/.)

(Same locations for sed.)

I should note that I'm doing this through R's system2 as a convenience only. If you're comfortable enough on the bash command line, then this is just as easy to perform there instead.

data generation

I don't know where the nuls are in your file, so I'll assume that they are record (line) terminators. That is, in most files you'll see each line ending with \n or \r\n, but for this example I'll replace the \n with \0 (nul).

charToRaw("a|b\nhello|world")
#  [1] 61 7c 62 0a 68 65 6c 6c 6f 7c 77 6f 72 6c 64
ch <- charToRaw("a|b\nhello|world")
ch[ch == charToRaw("\n")] <- as.raw(0)
ch
#  [1] 61 7c 62 00 68 65 6c 6c 6f 7c 77 6f 72 6c 64
writeBin(ch, "raw.txt")
readLines("raw.txt")
# Warning in readLines("raw.txt") :
#   line 1 appears to contain an embedded nul
# Warning in readLines("raw.txt") :
#   incomplete final line found on 'raw.txt'
# [1] "a|b"

The nul is a problem (as intended), so we don't see anything after the embedded nul.

tr

tr doesn't like doing things in place, so this takes as input the original file and generates a new file. If file-size and disk space is a concern, then perhaps sed would be preferred.

system2("tr", c("\\0", "\\n"), stdin = "raw.txt", stdout = "raw2.txt")
readLines("raw2.txt")
# Warning in readLines("raw2.txt") :
#   incomplete final line found on 'raw2.txt'
# [1] "a|b"         "hello|world"

(That warning is safe to ignore here.)

sed

sed can optionally work in-place with the -i argument. (Without it, it can operate the same as tr: generate a new file based on the original.)

system2("sed", c("-i", "s/\\x0/\\n/g", "raw.txt"))
readLines("raw.txt")
# Warning in readLines("raw.txt") :
#   incomplete final line found on 'raw.txt'
# [1] "a|b"         "hello|world"

(That warning is safe to ignore here.)

other than record-terminator

If the nul is not the record terminator (\n-like) character, than you have some options:

  1. Replace the \0 character with something meaning, such as Z (stupid, but you get the point). This should use the above commands as-is, replacing the \\n with your character of choice. (tr will require a single-character, sed can replace it with multiple characters if you like.)

  2. Delete the \0 completely, in which case you can use tr -d '\0' and sed -i -e 's/\x0//g' (translated into R's system2 calls above).

r2evans
  • 141,215
  • 6
  • 77
  • 149
1
  • vonPryz' helpful answer shows the immediate problem with your -replace operation: in a single-quoted PowerShell string ('...'), \ chars. do not need escaping as \\ in order to be passed verbatim to the .NET regex engine that the -replace operator uses behind the scenes.

    • Thus, '\0', when passed to the .NET regex engine from PowerShell, is sufficient to represent a NUL character (a Unicode character whose code point is 0); alternatively, you could use "`0", a double-quoted, expandable PowerShell string, in which ` serves as the escape character.
  • r2evans' helpful answer shows an alternative solution via the Windows ports of standard Unix utilities that come with the optional Rtools download, where piping the input file to
    tr -d '\0' may offer the fastest solution, if both the input and the output file use the same character encoding.

In the realm of PowerShell, using Get-Content with its default line-by-line processing with such a large input file would take too long in practice.

While direct use of .NET APIs may offer the ultimately fastest solution, using Get-Content's
-ReadCount parameter offers a simpler, more PowerShell-idiomatic solution:

Get-Content -ReadCount 1e6 file.txt | foreach { $_ -replace '\0' } | 
  Out-File -Encoding UTF8 file_NEW.txt

-ReadCount 1e6 reads 1 million lines (1e6 uses exponential notation, i.e. 10 to the power of 6) at once and passes them as an array to the ForEach-Object cmdlet (one of whose built-in aliases is foreach); since the -replace operator is capable of operating on an array of values as its LHS, the NUL substitution can be performed on all elements of the array at once.

Depending on how many bytes make up the average line in your input file, you can adjust this number upward, if you have more memory available, or downward, if you have less. The higher the number you can use, the faster the command will complete.

mklement0
  • 382,024
  • 64
  • 607
  • 775