101

I need to split a large (500 MB) text file (a log4net exception file) into manageable chunks like 100 5 MB files would be fine.

I would think this should be a walk in the park for PowerShell. How can I do it?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ralph Shillington
  • 20,718
  • 23
  • 91
  • 154

16 Answers16

87

A word of warning about some of the existing answers - they will run very slow for very big files. For a 1.6 GB log file I gave up after a couple of hours, realising it would not finish before I returned to work the next day.

Two issues: the call to Add-Content opens, seeks and then closes the current destination file for every line in the source file. Reading a little of the source file each time and looking for the new lines will also slows things down, but my guess is that Add-Content is the main culprit.

The following variant produces slightly less pleasant output: it will split files in the middle of lines, but it splits my 1.6 GB log in less than a minute:

$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB


$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
    do {
        "Reading $upperBound"
        $count = $fromFile.Read($buff, 0, $buff.Length)
        if ($count -gt 0) {
            $to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
            $toFile = [io.file]::OpenWrite($to)
            try {
                "Writing $count to $to"
                $tofile.Write($buff, 0, $count)
            } finally {
                $tofile.Close()
            }
        }
        $idx ++
    } while ($count -gt 0)
}
finally {
    $fromFile.Close()
}
Hugo Buff
  • 411
  • 5
  • 14
Typhlosaurus
  • 1,528
  • 12
  • 14
  • 6
    this approach worked well for me on a 6GB file that I needed to get split out in an emergency situation to more efficiently analyze in smaller chunks. thanks for posting! – xinunix Jun 15 '12 at 04:01
  • 10
    It took me a couple of run-throughs to figure out how this script really works. I made a Gist of it, in case anyone's interested: https://gist.github.com/awayken/5861923 – awayken Jun 25 '13 at 20:14
  • 2
    Is there any reason you didn't use `StreamReader`? So that you can split with new lines? – stej Nov 10 '14 at 07:49
  • 1
    @stej based on this answer I added streamreader version in my answer as I needed it. – Vincent De Smet Feb 10 '15 at 13:12
  • 1
    Thanks @stej, @VincentDeSmet: that's nice - no particular reason why I didn't use `StreamReader`. – Typhlosaurus May 04 '15 at 12:02
  • 3
    If you add these lines to the begging of the script to define the variables and modify them to suit the file you are trying to split, you'll be all set! $from = "C:\temp\large_log.txt" $rootName = "C:\temp\large_log_chunk" $ext = "txt" – Yves Rochon Dec 04 '15 at 13:59
  • NOTE what the poster said - it splits midline! Used this (teach me not to read all of it), then caught that. (That the files were all the exact same size should've been a hint) – mbourgon Jul 06 '20 at 18:52
  • this code is successfully create multiple file or not? – kiran_ray Apr 02 '21 at 09:18
  • This worked perfectly for me. I changed the limit to 50MB, and it split my 900 MB file to multiple 50 MB files in a matter of seconds. Thank you! – Apolymoxic Sep 04 '21 at 02:57
  • My old way was well into 30 mins when I saw this. Less than a second! – darkstar3d Nov 17 '21 at 23:40
86

Simple one-liner to split based on number of lines (100 in this case):

$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt}
Ivan
  • 9,089
  • 4
  • 61
  • 74
59

This is a somewhat easy task for PowerShell, complicated by the fact that the standard Get-Content cmdlet doesn't handle very large files too well. What I would suggest to do is use the .NET StreamReader class to read the file line by line in your PowerShell script and use the Add-Content cmdlet to write each line to a file with an ever-increasing index in the filename. Something like this:

$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"

$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line
    if((Get-ChildItem -path $fileName).Length -ge $upperBound)
    {
        ++$count
        $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
    }
}

$reader.Close()
thomasb
  • 5,816
  • 10
  • 57
  • 92
Lee
  • 18,529
  • 6
  • 58
  • 60
  • 1
    This is exactly what I was looking for, and thanks for confirming my hunch that get-content is not great with large files. – Ralph Shillington Jun 16 '09 at 19:53
  • 4
    Helpful tip: You can express numbers like this ... $upperBound = 5MB – Lee Jun 16 '09 at 20:02
  • 3
    For those too lazy to read the next answer, you can set the $reader object via $reader = new-object System.IO.StreamReader($inputFile) – lmsurprenant Jul 14 '11 at 12:27
  • 2
    I'd suggest using a stringbuilder to concatenate individual lines before calling add-content to write content otherwise this approach is very slow. – Richard Dorman Mar 27 '14 at 13:41
  • @CVertex you do realize that your script reads the entire file into memory first? So that will never work for a truly huge file (multiple GBs). – thekip Nov 16 '15 at 16:59
  • @thekip yes, I know, and I don't care. read my answer http://stackoverflow.com/a/27363742/209 – CVertex Nov 16 '15 at 22:10
  • Wouldn't it be better to keep a $lineCounter to count how many lines were written so you don't have to read the output file every iteration? – IMTheNachoMan Nov 18 '17 at 02:42
  • I have to assign value to $rootName, $ext, $upperBound, before I run this code right? – Jing He Dec 16 '17 at 16:48
  • For those of you (like me) who get caught out by the fact that .net objects do not maintain the same current working directory as powershell, you can use this to set the .net working directory to be the same as powershell (this will allow relative paths to work): [Environment]::CurrentDirectory = (Get-Location -PSProvider FileSystem).ProviderPath – Tom Ferguson Sep 05 '18 at 10:02
  • Way too slow to be even usable for a large file – MoonStom Apr 26 '22 at 15:38
50

Same as all the answers here, but using StreamReader/StreamWriter to split on new lines (line by line, instead of trying to read the whole file into memory at once). This approach can split big files in the fastest way I know of.

Note: I do very little error checking, so I can't guarantee it'll work smoothly for your case. It did for mine (1.7 GB TXT file of 4 million lines split in 100,000 lines per file in 95 seconds).

#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "C:\Users\Vincent\Desktop\test.txt"
$rootName = "C:\Users\Vincent\Desktop\result"
$ext = ".txt"

$linesperFile = 100000#100k
$filecount = 1
$reader = $null
try{
    $reader = [io.file]::OpenText($filename)
    try{
        "Creating file number $filecount"
        $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
        $filecount++
        $linecount = 0

        while($reader.EndOfStream -ne $true) {
            "Reading $linesperFile"
            while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
                $writer.WriteLine($reader.ReadLine());
                $linecount++
            }

            if($reader.EndOfStream -ne $true) {
                "Closing file"
                $writer.Dispose();

                "Creating file number $filecount"
                $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
                $filecount++
                $linecount = 0
            }
        }
    } finally {
        $writer.Dispose();
    }
} finally {
    $reader.Dispose();
}
$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

Output splitting a 1.7 GB file:

...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in  95.6308289 seconds
Vincent De Smet
  • 4,859
  • 2
  • 34
  • 41
  • 5
    For someone who would want to use the solution above and also have repeating headers, the one step you would need to do is add the code - $writer.WriteLine($header) after the comment - "Reading $linesperFile". $header would be the variable that you would need to declare with all the desired columns in the initial part of the code. Thanks @Vincent for the blazing fast solution – VKarthik Nov 17 '16 at 09:30
  • Using Measure-Object is probably better than stopwatch, but this is good. – Christopher Oct 12 '18 at 13:37
  • It took me 37 minutes to split a 10gb file. The solution this was derived from ran for 30 minutes before I cancelled it and hadn't yet succeeded in getting the file into memory, perhaps because I didn't have 10gb of memory available. – n8. May 12 '20 at 17:11
  • 1
    For @VKarthik's header solution, you can also automatically initialize the header from the first row of the file by putting `$header = $reader.ReadLine();` just after `$reader = [io.file]::OpenText($filename)` – Mark Sowul Jun 17 '21 at 01:18
  • 1
    By far the best solution!! It is fast and maintain the original encoding. The other solutions above read and re-write the contents. They all break the language encoding. Awesome. Thanks so much! – BiGGA Apr 06 '22 at 08:49
17

I often need to do the same thing. The trick is getting the header repeated into each of the split chunks. I wrote the following cmdlet (PowerShell v2 CTP 3) and it does the trick.

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('c')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Count,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

        # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

            # get the input file in manageable chunks

            $Part = 1
            Get-Content $_ -ReadCount:$Count | %{

                # make an output filename with a suffix
                $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

                # In the first iteration the header will be
                # copied to the output file as usual
                # on subsequent iterations we have to do it
                if ($RepeatCount -and $Part -gt 1) {
                    Set-Content $OutputFile $Header
                }

                # write this chunk to the output file
                Write-Host "Writing $OutputFile"
                Add-Content $OutputFile $_

                $Part += 1

            }

        }

    }

}
Josh
  • 68,005
  • 14
  • 144
  • 156
  • works nicely. Might want to turn count into a long when you want to have more lines per file. Also, this script runs out of memory if you write huge files. – Wouter Sep 23 '13 at 11:49
  • Very handy for splitting a simple single-column text-file of server names into multiples for batch processing. – Signal15 Aug 27 '14 at 16:19
  • @Josh i tried you approach .. where i got this result `Creating file number 1 Reading 500 Closing file Creating file number 2 Reading 500 Closing file Creating file number 3 .... Creating file number 13 Reading 500 Split complete in 3.419523 seconds` but i could not locate where the files got created? – John John Sep 27 '21 at 22:10
14

I found this question while trying to split multiple contacts in a single vCard VCF file to separate files. Here's what I did based on Lee's code. I had to look up how to create a new StreamReader object and changed null to $null.

$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count) 

while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line

    if($line -eq "END:VCARD")
    {
        ++$count
        $filename = "C:\Contacts\{0}.vcf" -f ($count)
    }
}

$reader.Close()
user202448
  • 2,552
  • 5
  • 22
  • 25
9

Many of these answers were too slow for my source files. My source files were SQL files between 10 MB and 800 MB that needed to split into files of roughly equal line counts.

I found some of the previous answers which use Add-Content to be quite slow. Waiting many hours for a split to finish wasn't uncommon.

I didn't try Typhlosaurus's answer, but it looks to only do splits by file size, not line count.

The following has suited my purposes.

$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length

Write-Host "Total Lines :" $totalLines

$skip = 0
$count = 100000; # Number of lines per file

# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")

while ($skip -le $totalLines) {
    $upper = $skip + $count - 1
    if ($upper -gt ($lines.Length - 1)) {
        $upper = $lines.Length - 1
    }

    # Write the lines
    [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])

    # Increment counters
    $skip += $count
    $fileNumber++
    $fileNumberString = $filenumber.ToString("000")
}

$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

For a 54 MB file, I get the output...

Reading source file...
Total Lines : 910030
Split complete in  1.7056578 seconds

I hope others looking for a simple, line-based splitting script that matches my requirements will find this useful.

Community
  • 1
  • 1
CVertex
  • 17,997
  • 28
  • 94
  • 124
  • But this will consume a lot of memory. I'm trying to re-write using streamreader/writer – Vincent De Smet Feb 10 '15 at 11:31
  • see my answer below for a memory friendly, new line based split – Vincent De Smet May 04 '15 at 16:09
  • If it happens in a few seconds then I fail to see why memory is a concern. I waited 10 minutes for the "answer" solution to ultimately accomplish nothing while I implemented this solution and it was finished in a little over 5 seconds. – n8. Mar 24 '17 at 21:04
  • It is quite fast indeed, I had to split a 740Mb file, it took **19s** to run. The accepted solution ran for **73 (!) minutes** for the same file. This is definitely my choice. – Damian Vogel Nov 09 '18 at 17:24
3

There's also this quick (and somewhat dirty) one-liner:

$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } }

You can tweak the number of first lines per batch by changing the hard-coded 3000 value.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
zroiy
  • 39
  • 2
3

Do this:

FILE 1

There's also this quick (and somewhat dirty) one-liner:

    $linecount=0; $i=0; 
    Get-Content .\BIG_LOG_FILE.txt | %
    { 
      Add-Content OUT$i.log "$_"; 
      $linecount++; 
      if ($linecount -eq 3000) {$I++; $linecount=0 } 
    }

You can tweak the number of first lines per batch by changing the hard-coded 3000 value.

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII

FILE 2

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII

FILE 3

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII

etc…

Shantanu Gupta
  • 20,688
  • 54
  • 182
  • 286
ecciethetechie
  • 157
  • 1
  • 2
  • thanks i ended up using this... but don't forget to add -width for outfile or it might truncate your output at 80 chars... also this operates one line at a time ... is faster to use gc -readcount 1000 | select -first 5 ... this does 1000 lines at a time ... finally gc will read the whole file and select will ignore most of it ... a little faster to include the -totalcount param with gc to stop after certain number of lines ... can do -tail for end of file too – TCC Jun 04 '14 at 21:23
3

Sounds like a job for the UNIX command split:

split MyBigFile.csv

Just split my 55 GB csv file in 21k chunks in less than 10 minutes.

It's not native to PowerShell though, but comes with, for instance, the git for windows package https://git-scm.com/download/win

NicolasG
  • 31
  • 4
2

I've made a little modification to split files based on size of each part.

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Size
# (Or -s) The maximum size of each file. Size must be expressed in MB.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv -s 20 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('s')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Size,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

  # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

   Resolve-Path @ResolveArgs | %{

    $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
    $InputExt  = [IO.Path]::GetExtension($_)

    if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

    # get the input file in manageable chunks

    $Part = 1
    $buffer = ""
    Get-Content $_ -ReadCount:1 | %{

     # make an output filename with a suffix
     $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

     # In the first iteration the header will be
     # copied to the output file as usual
     # on subsequent iterations we have to do it
     if ($RepeatCount -and $Part -gt 1) {
      Set-Content $OutputFile $Header
     }

     # test buffer size and dump data only if buffer is greater than size
     if ($buffer.length -gt ($Size * 1MB)) {
      # write this chunk to the output file
      Write-Host "Writing $OutputFile"
      Add-Content $OutputFile $buffer
      $Part += 1
      $buffer = ""
     } else {
      $buffer += $_ + "`r"
     }
    }
   }
        }
    }
}
1

As the lines can be variable in logs I thought it best to take a number of lines per file approach. The following code snippet processed a 4 million line log file in under 19 seconds (18.83.. seconds)splitting it into 500,000 line chunks:

$sourceFile = "c:\myfolder\mylargeTextyFile.csv"
$partNumber = 1
$batchSize = 500000
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"

[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001)  # utf8 this one

$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None") 
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename

$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
    $streamout.writeline($line)
    $counter +=1
    if ($counter -eq $batchsize)
    {
        $partNumber+=1
        $counter =0
        $streamOut.close()
        $pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
        $streamout = new-object System.IO.StreamWriter $pathAndFilename

    }
    $line = $streamIn.readline()
}
$streamin.close()
$streamout.close()

This can easily be turned into a function or script file with parameters to make it more versatile. It uses a StreamReader and StreamWriter to achieve its speed and tiny memory footprint

GMasucci
  • 2,834
  • 22
  • 42
0

My requirement was a bit different. I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data. And they're really big, so I need to split them into manageable parts (whilst preserving the header row).

So, I reverted back to my classic VBScript method and bashed together a small .vbs script that can be run on any Windows computer (it gets automatically executed by the WScript.exe script host engine on Window).

The benefit of this method is that it uses Text Streams, so the underlying data isn't loaded into memory (or, at least, not all at once). The result is that it's exceptionally fast and it doesn't really need much memory to run. The test file I just split using this script on my i7 was about 1 GB in file size, had about 12 million lines of text and was split into 25 part files (each with about 500k lines each) – the processing took about 2 minutes and it didn’t go over 3 MB memory used at any point.

The caveat here is that it relies on the text file having "lines" (meaning each record is delimited with a CRLF) as the Text Stream object uses the "ReadLine" function to process a single line at a time. But hey, if you're working with TSV or CSV files, it's perfect.

Option Explicit

Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"  
Private Const REPEAT_HEADER_ROW = True                
Private Const LINES_PER_PART = 500000                 

Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart

sStart = Now()

sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1

Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)

If REPEAT_HEADER_ROW Then
    iLineCounter = 1
    sHeaderLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sHeaderLine)
End If

Do While Not oInputFile.AtEndOfStream
    sLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sLine)
    iLineCounter = iLineCounter + 1
    If iLineCounter Mod LINES_PER_PART = 0 Then
        iOutputFile = iOutputFile + 1
        Call oOutputFile.Close()
        Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
        If REPEAT_HEADER_ROW Then
            Call oOutputFile.WriteLine(sHeaderLine)
        End If
    End If
Loop

Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing

Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())
Covenant
  • 59
  • 1
  • 1
0

If this may help, it works perfectly for me.

Script check a folder, parse all CSV files and check nb of lines per file. If file contains more than 55000 lines in file, script split the file into sub-files of 50000 lines and name them " _1, _2, ...." At end of the script, original file is renamed to avoid a load.

foreach ($MyFile in $MyFolder)
{

    # Read parent CSV
    
    $InputFilename         = $MyFile
    $InputFile             = Get-Content $MyFile    
    $OutputFilenamePattern = "$MyFile"+"_"
    
    Write-Host ".........." 
    Write-Host ". File to process"  
    Write-Host ".........."         
    WRITE-HOST "$MyVar_file_Path"
    Write-Host "$InputFilename"
    Write-Host "$OutputFilenamePattern"
    Write-Host ".........." 
    
    $LineLimit = 50000

    # Initialize
    $line  = 0
    $i     = 0
    $file  = 0
    $start = 0

    $nb_lines = (Get-Content $MyFile).Length
    Write-Host ".........."         
    Write-Host "$nb_lines lines in the file"    
    Write-Host ".........." 

    if ($nb_lines -gt 55000) 
    {     
        # Loop all text lines
        while ($line -le $InputFile.Length) 
        {
            # Generate child CSVs
            if ($i -eq $LineLimit -Or $line -eq $InputFile.Length) 
            {
                $file++
                $Filename = "$OutputFilenamePattern$file.csv"
                # $InputFile[0] | Out-File $Filename -Force # Writes Header at the beginning of the line.
                If ($file -ne 1) {$InputFile[0] | Out-File $Filename -Force}
                $InputFile[$start..($line - 1)] | Out-File $Filename -Force -Append # Original line 19 with the addition of -Append so it doesn't overwrite the headers you just wrote.
                # $InputFile[$start..($line-1)] | Out-File $Filename -Force

                $start = $line;
                $i = 0
                Write-Host "$Filename"
            }

            # Increment counters
            $i++;
            $line++
        }

        $Source_name      = $MyVar_file_Path2 + "\" + $InputFilename
        $Destination_name = $MyVar_file_Path2 + "\" + "Splitted_" + $InputFilename

        Write-Host ".........." 
        Write-Host ". File to rename"   
        Write-Host ".........."         
        Write-Host "$Source_name"
        Write-Host "$Destination_name" 
        Write-Host ".........."             
    
        Rename-Item $Source_name -NewName $Destination_name     
    }       

    Write-Host "."
    Write-Host "."      

}
-1

Here is my solution to split a file called patch6.txt (about 32,000 lines) into separate files of 1000 lines each. Its not quick, but it does the job.

$infile = "D:\Malcolm\Test\patch6.txt"
$path = "D:\Malcolm\Test\"
$lineCount = 1
$fileCount = 1

foreach ($computername in get-content $infile)
{
    write $computername | out-file -Append $path_$fileCount".txt"
    $lineCount++

    if ($lineCount -eq 1000)
    {
        $fileCount++
        $lineCount = 1
    }
}
Bigdadda06
  • 45
  • 1
  • 1
  • 8
-1

I modified the answer from @Vincent De Smet with the comments from @VKarthik and @Mark Sowul (https://stackoverflow.com/a/28432606/22060286) in order to read, find and store a longer header (I'm not allowed to write comments unfortunately)

This makes sense for example to split huge html files or non-standard csv where the header is longer than one line.

This is the complete script:

#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "veryhuge_html_log.html"
$rootName = $filename + "_split_"
$ext = "html"
$headerend = "<body "

$linesperFile = 100000#100k
$filecount = 1
$reader = $null

try{
    $reader = [io.file]::OpenText($filename)
    
    while ($true) {
        $header += $reader.ReadLine();
        if ($header.Contains($headerend)) { 
            "found the header end '$headerend'"
            break
            }
        }
        
    try{
        "Creating file number $filecount"
        $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
        $filecount++
        $linecount = 0

        while($reader.EndOfStream -ne $true) {
            "Reading $linesperFile"
            $writer.Write($header)
            "Wrote header"
            while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
                $writer.WriteLine($reader.ReadLine());
                $linecount++
            }

            if($reader.EndOfStream -ne $true) {
                "Closing file"
                $writer.Dispose();

                "Creating file number $filecount"
                $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
                $filecount++
                $linecount = 0
            }
        }
    } finally {
        $writer.Dispose();
    }
} finally {
    $reader.Dispose();
}
$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

This example produces invalid html of course because the footer is missing but this does not bother any browser.

Rozz Bob
  • 1
  • 1