0

I have huge files with size of ~3GB. These files have information section at the top as well as at the bottom, and these number of information lines differs from file to file. i.e

infostart1
infostart2
START-OF-DATA
line1
line2
...
...
...
linen
END-OF-DATA
infoend1
infoend2

etc. I am trying to create a datfile that will copy only the lines between START-OF-DATA and END-OF-DATA.

$DataStartLineNumber = (Select-String $File -Pattern 'START-OF-DATA' | Select-Object -ExpandProperty 'LineNumber')[0]
$DataEndLineNumber = (Select-String $File -Pattern 'END-OF-DATA' | Select-Object -ExpandProperty 'LineNumber')[-1]

I have tried:

Get-Content -Path $File | Select-Object -Index ($DataStartLineNumber..($DataEndLineNumber-2)) | Add-Content $Destination

but Get-Content fails due to memory usage.

I have also tried:

Get-Content -Path $File -ReadCount 10000 | Select-Object -Index ($DataStartLineNumber..$DataEndLineNumber) | Add-Content $Destination

However , this does not work as expected.

I don't want to read line by line since it takes too long. Is there any way to read chunks of data from the file and apply the filter to eliminate anything that comes before 'START-OF-DATA' and after 'END-OF-DATA'. Or copy the file as is and then delete anything that comes before 'START-OF-DATA' and after 'END-OF-DATA' in an efficient way.

wOxxOm
  • 65,848
  • 11
  • 132
  • 136
yasemin
  • 95
  • 3
  • 11
  • 1
    http://stackoverflow.com/questions/4192072/how-to-process-a-file-in-powershell-line-by-line-as-a-stream and http://stackoverflow.com/questions/32336756/alternative-to-get-content – Matt Feb 22 '17 at 17:23
  • 2
    Get-Content sucks with large files. Stream reader would be the way to go here. Run a couple of flags/bools so you know when to start and stop processing lines in your file. – Matt Feb 22 '17 at 17:24
  • thank you Matt, I will look into it, I hope I can find an efficient way. – yasemin Feb 22 '17 at 17:26
  • thank you Matt, this helped a lot, and it is really fast. – yasemin Feb 22 '17 at 20:14

2 Answers2

1

As Matt mentions in the comments, you can read the file line by line yourself, with a StreamReader.

I'd suggest "skipping ahead" to the start with one loop, then collecting the relevant lines with another:

$Reader = New-Object System.IO.StreamReader 'C:\Path\to\file.txt'
$StartBoundary = 'START-OF-DATA'
$EndBoundary = 'END-OF-DATA'

# Skip ahead to the starting boundary
while(-not($Reader.EndOfStream) -and ($line = $Reader.ReadLine()) -notmatch $StartBoundary){ <#nothing to be done here#> }

# Output all lines until we hit the end boundary
$lines = while(-not($Reader.EndOfStream) -and ($line = $Reader.ReadLine()) -notmatch $EndBoundary){ $line }

# $lines now contain the data
Community
  • 1
  • 1
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
0

i dont know if your memory problem will be resolved but try this

$template=@"
{Content*:START-OF-DATA
line1
END-OF-DATA}
{Content*:START-OF-DATA
line2
Line3
END-OF-DATA}
"@

Get-ChildItem "C:\temp\test" -file | foreach {

  $Data=Get-Content $_.FullName | ConvertFrom-String -TemplateContent $template

  if ($Data -ne $null)
  {
     [pscustomobject]@{FullName=$_.FullName; Content=$Data} 
  }



} | Format-Table -Wrap
Esperento57
  • 16,521
  • 3
  • 39
  • 45