0

Reading random HTML input files of indiscriminate size I limit read up to 1000 lines or end /html tag. Works fine.

Problem is with small files and when /html tag is missing. I'd like to know when it got to the end.

Question: Is there a some sort of EOF property for that..?

$fileContents = (Get-Content -LiteralPath $filePath -totalcount 999)
ForEach ($line in $fileContents){
        $LineNo = $LineNo +1;
        if ((($line.ToLower().StartsWith("</html>"))) -or ($LineNo -gt 999) -or ??? END_OF_$fileContents ???)
            {
                # Do the rest of the processing in here...
            }
        }

Couple of days later, here is my final code to handle this (scroll to right);

$fileContents = (Get-Content -LiteralPath $filePath -totalcount 999)
ForEach ($line in $fileContents){
        $LineNo = $LineNo +1;
        if ((($line.ToLower().StartsWith("</html>"))) -or ($LineNo -gt 500) -or ($LineNo -ge $fileContents.Count))
            {
                # Do the rest of the processing in here...
            }
        }

The idea of doing it this way is to "bail out" of processing massive html files, but still be able to handle a little ones, even if they aren't properly formatted (common in email files).

rangi
  • 361
  • 2
  • 4
  • 21
  • 3
    $fileContents.Count will return the line count you can then use in your ForEach loop or switch to a straight For loop since you know the number of lines retrieved. – RetiredGeek Mar 09 '21 at 00:42
  • Yeah, thanks @RetiredGeek I think you're right, that's the only way to do it – rangi Mar 09 '21 at 00:48

2 Answers2

1

No need for extra EOF file checking. We will use Get-Content to handle both the EOF check, as well as the line count (999 lines) check.

We then can use Regular Expressions to parse the file to find the close HTML Tag, and the line number(s).

$fileContents = (Get-Content -LiteralPath $filePath -TotalCount 999)

$result = $fileContents | Select-String '<\/html>'

if($result -ne $null -or $fileContents.Count -eq 999)
{
    # Hit Close HTML Tag or file is bigger than 1000 lines

    Write-Host "Hit close tag at line: $($result.LineNumber) or EOF"

    # Do the rest of the processing in here...
}
HAL9256
  • 12,384
  • 1
  • 34
  • 46
  • Thanks @HAL9256 normally that works fine, however if the file read is <999 lines and/or the end html tag is missing the subsequent process doesn't run. – rangi Mar 09 '21 at 02:25
  • @rangi, ah I see, I modified the answer to handle that case. In this case, we don't need a for loop, and can use RegEx to find the close tags. – HAL9256 Mar 09 '21 at 16:24
  • This is correct and we should acknowledge @RetiredGeek spotted it first..! My completed conditional line included above (SO changed response order somehow) – rangi Mar 10 '21 at 21:28
1

Then scan the file for line count first, use it directly or assign that to a variable, and process as needed.

Refactor of HAL9256 helpful answer:"

$filePath = 'D:\temp\Students.html'

Get-Content -LiteralPath $filePath -totalcount (Get-Content -Path $filePath).Count | 
ForEach-Object {
    if ($PSItem.ToLower().StartsWith("</html>"))
    {
        <#
        Hit Close HTML Tag
        Do the rest of the processing in here...
        #>
    }
}
postanote
  • 15,138
  • 2
  • 14
  • 25