0

I have to replace all occurrences of \ with \\ within a huge JSON Lines File. I wanted to use Powershell, but there might be other options too. The source file is 4.000.000 lines and is about 6GB.

The Powershell script I was using took too much time, I let it run for 2 hours and it wasn't done yet. A performance of half an hour would be acceptable.

$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file2.jsonl"

(Get-Content -Path $Importfile) -replace "[\\]", "\\" | Set-Content -Path $Exportfile

3 Answers3

1

If the replacement is simply a conversion of a single backslash to a a double backslash, the file can be processed row by row.

Using a StringBuilder puts data into a memory buffer, which is flushed on disk every now and then. Like so,

$src = "c:\path\MyBigFile.json"
$dst = "c:\path\MyOtherFile.json"
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText($src)
$i = 0
$MaxRows = 10000

while($null -ne ($line = $reader.ReadLine())) {
    # Replace slashes
    $line = $line.replace('\', '\\') 
    # ' markdown coloring is confused by backslash-apostrophe
    # so here is an extra one just for looks

    [void]$sb.AppendLine($line)
    ++$i

    # Write builder contents into file every now and then
    if($i -ge $MaxRows) {
        add-content $dst $sb.ToString() -NoNewline
        [void]$sb.Clear()
        $i = 0
    }
}
# Flush the builder after the while loop if there's data
if($sb.Length -gt 0) {
    add-content $dst $sb.ToString() -NoNewline
}
$reader.close()
vonPryz
  • 22,996
  • 7
  • 54
  • 65
  • It is extremely fast, only it puts an empty line at the end of the stringbuilder. Resulting in an empty line in the file every 10000th (MaxRows) line. It tried solving it put didn't succeed other than removing them when CSV file is ready, do you have a way to remove the empty line in an efficient way? – Martijn Olthuis May 07 '20 at 10:15
  • 1
    @MartijnOlthuis There's now `-NoNewline` parameter that should take care of extra newline issue. – vonPryz May 07 '20 at 10:27
0

Use -ReadCount parameter for Get-Content cmdlet (and set it to 0).

-ReadCount

Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.

This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in large items.

Example (runs cca 17× faster for a file cca 20MB):

$file = 'D:\bat\files\FileTreeLista.txt'
(Measure-Command { 
    $xType = (Get-Content -Path $file             ) -replace "[\\]", "\\"
}).TotalSeconds, $xType.Count -join ', '
(Measure-Command {
    $yType = (Get-Content -Path $file -ReadCount 0) -replace "[\\]", "\\"
}).TotalSeconds, $yType.Count -join ', '
Get-Item $file | Select-Object FullName, Length
13,3288848, 338070
 0,7557814, 338070

FullName                         Length
--------                         ------
D:\bat\files\FileTreeLista.txt 20723656
JosefZ
  • 28,460
  • 5
  • 44
  • 83
0

Based on the your earlier question How can I optimize this Powershell script, converting JSON to CSV?. You should try to use the PopwerShell pipeline for this, especially as it concerns large input and output files.
The point is that you shouldn't focus on single parts of the solution to determine the performance because this usually leaves wrong impression as the performance of a complete (PowerShell) pipeline solution is supposed to be better than the sum of its parts. Besides it saves a lot of memory and result is a lean PowerShell syntax...
In your specific case, if correctly setup, the CPU will replacing the slashes, rebuilds the json strings and converting it to objects while the harddisk is busy reading and writing the data...

To implement the replacement of the slashes into the PowerShell pipeline together with the ConvertFrom-JsonLines cmdlet:

Get-Content .\file.jsonl | ForEach-Object { $_.replace('\', '\\') } |
ConvertFrom-JsonLines | ForEach-Object { $_.events.items } |
Export-Csv -Path $Exportfile -NoTypeInformation -Encoding UTF8
iRon
  • 20,463
  • 10
  • 53
  • 79