0

I have one txt file with the following information:

...Page 1
Student 1 data
FF-form feed character (may or may not appear) [INCLUDE in parsed file]
...Page 2
Student 1 data
Student 1 data
********** END OF TRANSCRIPT **********
FF-form feed character (definitely appears in this position) [do not include in parsed file]
...Page 1
Student 2 data
Student 2 data
FF-form feed character (may or may not appear) [INCLUDE in parsed file]
...Page 2
Student 2 data
********** END OF TRANSCRIPT **********<
FF-form feed character (definitely appears in this position) [do not include in parsed file]
...Page 1
Student 3 data
Student 3 data
Student 3 data
FF-form feed character (may or may not appear) [INCLUDE in parsed file]
********** END OF TRANSCRIPT **********
FF-form feed character (definitely appears in this position) [do not include in parsed file]

I’m trying to parse out the data so I can get three separate files and delete the form feed that only appear after the “end of transcript” line.

I end up with three files:

DATE_EDI_TRANSCRIPT_1.txt that contains “Student 1 Data”
DATE_EDI_TRANSCRIPT_2.txt that contains “Student 2 Data”
DATE_EDI_TRANSCRIPT_3.txt that contains “Student 3 Data”

However, the form feed in the extracted files is at the beginning of each file. I want to remove it from the beginning and the end of the file.

I get this:

I get this

I want to get this:

enter image description here

My code is:

```

    $data =  Get-Content "C:\EDICleanUp\1_ToBeProcessed\edi.txt" #Reading file
    $Transcript = "_EDI_TRANSCRIPT_"
    $Tdate = get-date -Format yyyy-MM-dd
    $ProcessedFilePath = "C:\EDICleanUp\2_Processed"
    $Complete = "C:\EDICleanUp\3_Original"
    $ToBeProcessed = "C:\EDICleanUp\1_ToBeProcessed\edi.txt"

    $fileIndex = 1; #To create file name

    for ($itr = 0; $itr -le $data. Length; $itr++){    

        if($data[$itr] -eq "**********  END OF TRANSCRIPT  **********"){ 
           $fileIndex+=1;
    continue;
}   
if((Test-Path "$ProcessedFilePath\$Tdate$Transcript$fileIndex.txt") -eq $false){
    New-Item   "$ProcessedFilePath\$Tdate$Transcript$fileIndex.txt" -ItemType "File"  
}
#Append text to the file
Add-Content "$ProcessedFilePath\$Tdate$Transcript$fileIndex.txt" $data[$itr]
    }
    ##Move original file to completed directory
    Move-item $ToBeProcessed $Complete

```

I "think" the issue is with :

    if($data[$itr] -eq "**********  END OF TRANSCRIPT  **********"){ 
    $fileIndex+=1;

I can't figure out the proper code to look for the hard return/form feed.

I tried variations of:

'**********\s\s[END OF TRANSCRIPT]***********+\f'

with no luck.

Any input would be greatly appreciated.

  • Can you post the sample data as-is? No comments or markup, just the raw text as-is. – Mathias R. Jessen Jan 20 '23 at 18:35
  • @MathiasR.Jessen, Is there a way to upload a file to my post? Trying to post the data 'as-is' in a comment, truncates data. Student 1 data Student 1 data Student 1 data ********** END OF TRANSCRIPT ********** Student 2 data Student 2 data Student 2 data ********** END OF TRANSCRIPT ********** Student 3 data Student 3 data Student 3 data ********** END OF TRANSCRIPT ********** Student 4 data Student 4 data Student 4 data ********** END OF TRANSCRIPT ********** Student 5 data Student 5 data Student 5 data ********** END OF TRANSCRIPT ********** – Wayne W. Van Ellis Jan 21 '23 at 23:58
  • You can [edit your original post](https://stackoverflow.com/posts/75188031/edit) – Mathias R. Jessen Jan 23 '23 at 21:59

1 Answers1

0

Instead of line-by-line processing, I suggest reading the entire file at once with Get-Content -Raw and using the regex-based -split operator to split your file into the blocks of interest.

A simplified example:

# Read the file into a single, multi-line string.
$data =  Get-Content -Raw C:\EDICleanUp\1_ToBeProcessed\edi.txt
$Tdate = Get-Date -Format yyyy-MM-dd
$nr = @{ Value = 0 } # output file sequence number

$data `
  -split ([regex]::Escape('********** END OF TRANSCRIPT **********') + '\r?\n\f') `
  -ne '' |
  Set-Content -NoNewline -LiteralPath { 
    '{0}_EDI_TRANSCRIPT_{1}.txt' -f $Tdate, ++$nr.Value 
  }
  • [regex]::Escape('********** END OF TRANSCRIPT **********') + '\r?\n\f' is the regex to split the file contents by:

    • [regex]::Escape('********** END OF TRANSCRIPT **********') escapes the literal part of the search string for use as such in a regex (in effect, this \-escapes the * characters, which are regex metacharacters.

    • \r?\n matches either a Windows-format CRLF newline (\r\n) or a Unix-format LF newline (\n)

    • \f matches a FF char.

    • Note: It's not clear from your sample data whether the FF char. is preceded by newline, followed by a newline, or there's no newline at all - adjust the above as needed. Also, the assumption is that each END OF TRANSCRIPT instance is surrounded by the same, fixed number of * chars. and whitespace.

    • If there is no newline before the FF char. or you that the newline is always a CRLF sequence or just a LF character, you can get away with an expandable string literal ("...") containing escape sequences (see next section), combined with literal splitting via -split's SimpleMatch option; e.g.:

       $data -split "********** END OF TRANSCRIPT **********`f", 0, 'SimpleMatch'
      
  • -ne '' filters out empty blocks of lines from the result (if the file ends in a separator, -split considers the empty string after it another element).

  • A delay-bind script block is used with Set-Content to dynamically determine the output file name for each resulting block of lines.

    • Note how $nr, the output-file sequence number is defined as a hashtable (@{ ... }), not directly as an integer; this is required, because delay-bind script blocks run in a child scope of the caller; see this answer for an explanation.

    • The -f operator is used to synthesize the output file name.


As for what you tried:

I tried variations of: '**********\s\s[END OF TRANSCRIPT]***********+\f'

It is only the .NET regex engine (used behind the scenes by PowerShell's regex operators) that understands constructs such as \s and \f

Thus, to use something like the above you'd have to use the regex-based -match operator rather than the -eq operator.

However, PowerShell does have escape sequences for certain ASCII-range control characters (but not for abstractions such as \s to represent various whitespace chars.), which require the use of expandable (double-quoted) strings ("...") and backtick (`) escape sequences, such as "`f" for FF; in the modern PowerShell (Core) 7+ edition it is additionally possible to represent any Unicode character with an escape sequence - see conceptual about_Special_Characters help topic

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Thanks for the response. If I run it as written, it creates one file. Where do I incorporate this into my code in order to get a new file for each record that includes the data ending with "END OF TRANSCRIPT"? – Wayne W. Van Ellis Jan 21 '23 at 00:38
  • @WayneW.VanEllis, note the caveat in the answer re not knowing the exact format of the separators. If you only get one file, the implication is that the separator regex didn't match. Therefore: is there a newline before or after the FF, or none at all? Or is the whitespace and/or number of `*` chars. variable? – mklement0 Jan 21 '23 at 00:43
  • Makes sense now, I'll keep working on it. – Wayne W. Van Ellis Jan 21 '23 at 23:20
  • @WayneW.VanEllis, you can also just clarify what the actual file format is, based on the questions in my previous comment. Then it would be easy to update my answer to show the solution. – mklement0 Jan 22 '23 at 14:00
  • my apologies for my ignorance, but I "think" this is a simple plain text, txt. file. It's originally received as a .dat file, then it's run through a "edi process" that results in what I assume is a plain text file. The resulting plain text file is what I'm trying to manipulate.
    I was able to test further, and I now get different files (and no errors), but there's incorrect data inserted in each file.
    ``` File 1: \*\*\*\*\*\*\*\*\*\*\ File 2: \ File 3: END\ File 4: OF\ File 5: TRANSCRIPT\ File 6: \ File 7: \*\*\*\*\*\*\*\*\*\*\ ```
    – Wayne W. Van Ellis Jan 22 '23 at 18:35
  • @WayneW.VanEllis, download the `Debug-String` function from [this Gist](https://gist.github.com/mklement0/7f2f1e13ac9c2afaf0a0906d08b392d1), and run it as `Get-Command -Raw $yourFile | Debug-String -CaretNotation`, then update your question with an excerpt from the result covering everything from the start of the file through at least 2 sections. – mklement0 Jan 22 '23 at 22:06