Get-content chunk of data

Question

I have huge files with size of ~3GB. These files have information section at the top as well as at the bottom, and these number of information lines differs from file to file. i.e

infostart1
infostart2
START-OF-DATA
line1
line2
...
...
...
linen
END-OF-DATA
infoend1
infoend2

etc. I am trying to create a datfile that will copy only the lines between START-OF-DATA and END-OF-DATA.

$DataStartLineNumber = (Select-String $File -Pattern 'START-OF-DATA' | Select-Object -ExpandProperty 'LineNumber')[0]
$DataEndLineNumber = (Select-String $File -Pattern 'END-OF-DATA' | Select-Object -ExpandProperty 'LineNumber')[-1]

I have tried:

Get-Content -Path $File | Select-Object -Index ($DataStartLineNumber..($DataEndLineNumber-2)) | Add-Content $Destination

but Get-Content fails due to memory usage.

I have also tried:

Get-Content -Path $File -ReadCount 10000 | Select-Object -Index ($DataStartLineNumber..$DataEndLineNumber) | Add-Content $Destination

However , this does not work as expected.

I don't want to read line by line since it takes too long. Is there any way to read chunks of data from the file and apply the filter to eliminate anything that comes before 'START-OF-DATA' and after 'END-OF-DATA'. Or copy the file as is and then delete anything that comes before 'START-OF-DATA' and after 'END-OF-DATA' in an efficient way.

http://stackoverflow.com/questions/4192072/how-to-process-a-file-in-powershell-line-by-line-as-a-stream and http://stackoverflow.com/questions/32336756/alternative-to-get-content — Matt, Feb 22 '17 at 17:23
Get-Content sucks with large files. Stream reader would be the way to go here. Run a couple of flags/bools so you know when to start and stop processing lines in your file. — Matt, Feb 22 '17 at 17:24
thank you Matt, I will look into it, I hope I can find an efficient way. — yasemin, Feb 22 '17 at 17:26

score 1 · Accepted Answer · edited May 23 '17 at 12:31

As Matt mentions in the comments, you can read the file line by line yourself, with a StreamReader.

I'd suggest "skipping ahead" to the start with one loop, then collecting the relevant lines with another:

$Reader = New-Object System.IO.StreamReader 'C:\Path\to\file.txt'
$StartBoundary = 'START-OF-DATA'
$EndBoundary = 'END-OF-DATA'

# Skip ahead to the starting boundary
while(-not($Reader.EndOfStream) -and ($line = $Reader.ReadLine()) -notmatch $StartBoundary){ <#nothing to be done here#> }

# Output all lines until we hit the end boundary
$lines = while(-not($Reader.EndOfStream) -and ($line = $Reader.ReadLine()) -notmatch $EndBoundary){ $line }

# $lines now contain the data

Thank you Mathias, I have used your approach for start and end lines. It works :) — yasemin, Feb 22 '17 at 20:15

score 0 · Answer 2 · answered Feb 22 '17 at 17:54

0

i dont know if your memory problem will be resolved but try this

$template=@"
{Content*:START-OF-DATA
line1
END-OF-DATA}
{Content*:START-OF-DATA
line2
Line3
END-OF-DATA}
"@

Get-ChildItem "C:\temp\test" -file | foreach {

  $Data=Get-Content $_.FullName | ConvertFrom-String -TemplateContent $template

  if ($Data -ne $null)
  {
     [pscustomobject]@{FullName=$_.FullName; Content=$Data} 
  }



} | Format-Table -Wrap

answered Feb 22 '17 at 17:54

Esperento57

16,521
3
39
45

I didn't get a chance to try this solution, but thank you Esperento57 – yasemin Feb 22 '17 at 20:48

Get-content chunk of data

2 Answers2