How to split a very large text file (4GB) at pre-defined string in powershell and do it fast

Question

I have a large text file World.net (Which is a Pajek file, but consider it as text) with content:

*Vertices 999999
    1 ""                                       0.2931    0.2107    0.5000 empty
    2 ""                                       0.2975    0.2214    0.5000
    3 ""                                       0.3083    0.2258    0.5000
    4 ""                                       0.3127    0.2406    0.5000
    5 ""                                       0.3083    0.2514    0.5000
    6 ""                                       0.3147    0.2578    0.5000
...
    999999 ""                                       0.3103    0.2622    0.5000
*Edges :2 "World contours"
    1     2 1 
    2     3 1 
    3     4 1 
    4     5 1 
    5     6 1 
    6     7 1 
...
    983725     8 1

I would like to split it into different .txt files, at the lines that start with

*[Something]

The [Something] should go into the file name like World_Vertices.txt and World_Edges.txt.

File contents should be the lines (1,2,3...), following each category (Vertices, Edges) from the original file, without the category name (which starts with *).

I have a code that (kind-of) works:

$filename = "World"
echo $pwd\"$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
while (($line = $file.ReadLine()) -ne $null) {
    If ($line -match "^\*\w+") {
        $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
        echo $newfile
    }
    Else {
        $line | Out-File -Append $newfile
    }
}

But this code is very slow. It takes 20 minutes on a 10 mb file. And I would like to be able to process a 4GB file.

Hardware notes: The machine is good: i7 with hybrid disk, 16GB ram and I can install .net framework whichever-is-needed-to-do-the-job.

EDIT 1: Final code Fixing a few bugs in the accepted answer, here is the final code I used (It may be helpful for anyone, who wants to edit large pajek files):

$filename = "World.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename"
$writer = $null
$n = 0
while (($line = $file.ReadLine()) -ne $null) {
    If ($line.StartsWith("*")) {
        $n = 1
        $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
        echo $newfile
        if ($null -ne $writer) {
            $writer.Dispose()
        }
        $writer = New-Object System.IO.StreamWriter "$pwd\$newfile"
    }
    Else {
        If ($n -eq 0){
            $writer.WriteLine()
        }
        $writer.Write($line)
        $n = 0
    }
}
 $writer.Dispose()

Roughly how many files will it be? `-join("$filename ","$($line.Split('\*')[1]).txt"` is costly on array creation and deletion, and the subexpression - `'World {0}.txt' -f $line.Trim('*')` might be faster, but only worth it if there's a lot of files. `echo $newFile` outputs to the pipeline and I guess you don't intend that, instead maybe try it as `write-host $newfile`. Otherwise, marsze's answer looks like a big improvement, not making a new pipeline and opening/closing a file for every single line. — TessellatingHeckler, Sep 19 '17 at 07:42

marsze · Accepted Answer · 2017-09-19T10:58:21.530

In general, using .NET functions inside PowerShell is always the best way when performance is important. So using a StreamReader is already a good approach.

I changed your code to use a StreamWriter for writing to the output files:

$filename = "World"
echo "$pwd\$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
$writer = $null
while (($line = $file.ReadLine()) -ne $null) {
    If ($line -match "^\*\w+") {
        $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
        echo $newfile
        if ($null -ne $writer) {
            $writer.Dispose()
        }
        $writer = New-Object System.IO.StreamWriter "$pwd\$newfile"
    }
    Else {
        $writer.WriteLine($line)
    }
}

Try it.

There are other ways to further improve your performance. For instance, you might skip the expensive regex check. Use this instead:

if ($line.StartsWith("*"))

Thank you. It is much faster now. I only had to add "$pwd\$newfile" to the end of the writer line. — Borislav Aymaliev, Sep 19 '17 at 10:21

iRon · Answer 2 · 2017-09-19T10:19:27.203

1

Writing in general takes a lot of overhead.
So keep the section data in memory until it is completed and than write the whole section at once:

$filename = "World"
echo $pwd\"$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
while (($line = $file.ReadLine()) -ne $null) {
    If ($line -match "^\*\w+") {
        If ($newfile) {$section | Out-File $newfile}
        $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
        echo $newfile
        $section = @()
    }
    Else {
        $Section += $line
    }
}
If ($newfile) {$section | Out-File $newfile}

edited Sep 19 '17 at 10:19

answered Sep 19 '17 at 08:33

iRon

20,463
10
53
79

Thank you, but after testing it, I believe this script enters an infinite loop (not completely sure). – Borislav Aymaliev Sep 19 '17 at 10:14
Added `If ($newfile) {$section | Out-File $newfile}` to the end of the example because if you completed the file, you still need to write the last `$section` – iRon Sep 19 '17 at 10:30
My example was just a programming direction of what you provided which I presumed working but just slow. My example doesn't affect the `$file` object meaning that it shouldn't impact the `while` condition. Anyway, I do not have the actual data and no clue how big a `$section` could actually grow... – iRon Sep 19 '17 at 11:01

How to split a very large text file (4GB) at pre-defined string in powershell and do it fast

2 Answers2