Powershell 7.x How to Select a Text Substring of Unknown Length Only Using Boundary Substrings

Question

I am trying to store a text file string which has a beginning and end that make it a substring of the original text file. I am new to Powershell so my methods are simple/crude. Basically my approach has been:

Roughly get what I want from the start of the string
Worry about trimming off what I don't want later

My minimum reproducible example is as follows:

# selectStringTest.ps    
         
$inputFile = Get-Content -Path "C:\test\test3\Copy of 31832_226140__0001-00006.txt"

#  selected text string needs to span from $refName up to $boundaryName 
[string]$refName = "001 BARTLETT"
[string]$boundaryName = "001 BEECH"

# a rough estimate of the text file lines required
[int]$lines = 200
   
if (Select-String  -InputObject $inputFile -pattern $refName) {
    Write-Host "Selected shortened string found!"
    # this selects the start of required string but with extra text
    [string]$newFileStart = $inputFile | Select-String $refName -CaseSensitive -SimpleMatch -Context 0, $lines   
}
else {
    Write-Host "Selected string NOT FOUND."
}
# tidy up the start of the string by removing rubbish
$newFileStart = $newFileStart.TrimStart('> ')

# this is the kind of thing I want but it doesn't work
$newFileStart = $newFileStart - $newFileStart.StartsWith($boundaryName)

$newFileStart | Out-File tempOutputFile

As it is: the output begins correctly but I cannot remove text including and after $boundaryName

The original text file is OCR generated (Optical Character Recognition) So it is unevenly formatted. There are newlines in odd places. So I have limited options when it comes to delimiting.

I am not sure my if (Select-String -InputObject $inputFile -pattern $refName)is valid. It appears to work correctly. The general design seems crude. In that I am guessing how many lines I will need. And finally I have tried various methods of trimming the string from $boundaryName without success. For this:

string.split() not practical
replacing spaces with newlines in an array & looping through to elements of $boundaryName is possible but I don't know how to terminate the array at this point before returning it to string.

Any suggestions would be appreciated.

Abbreviated content of x2 200 listings single Copy of 31832_226140__0001-00006.txt file is:

Beginning of text file

________________

BARTLETT-BEDGGOOD
PENCARROW COMPOSITE ROLL
PAGE 6
PAGE 7
PENCARROW COMPOSITE ROLL
BEECH-BEST
www.
.......................
001 BARTLETT. Lois Elizabeth

Middle of text file

............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
001 BEECH, Margaret ..........

End of text file

..............312 Munita Rood Eastbourne, Civil Eng 200 BEST, Dons Amy .........
..........50 Man Street, Wamuomata, Marned
SO NON

Each text file is approximately 30KB. They span 250 lines in Notepad++ I will add a 'sample'. That is edited start text-end text. — Dave, Feb 22 '22 at 21:28
Yeah, not asking for the complete file, just a representation of how the file looks and what would you like to have as a result — Santiago Squarzon, Feb 22 '22 at 21:29
Just to confirm I understood correctly, you're looking to extract all the text between `001 BARTLETT` and `001 BEECH` ? And if so, do you want to include or exclude those key words ? — Santiago Squarzon, Feb 22 '22 at 22:10
Should the boundary strings always begin at the beginning of a line? Should the lines containing the boundary strings be included in the output? — lit, Feb 22 '22 at 22:13
I am parsing text into two 200 listings (001 to 200) Therefore the boundary string is *NOT* included. The boundary of first listing becomes the start of the second listing. My code works for a single listing. However most of the files are double listings. Hence the need to separate them. Hope that makes sense. — Dave, Feb 22 '22 at 22:25

lit · Accepted Answer · 2022-02-23T00:03:18.030

1

To use a regex across newlines, the file needs to be read as a single string. Get-Content -Raw will do that. This assumes that you do not want the lines containing refName and boundaryName included in the output

$c = Get-Content -Path '.\beech.txt' -Raw
$refName = "001 BARTLETT"
$boundaryName = "001 BEECH"

if ($c -match "(?smi).*$refName.*?`r`n(.*)$boundaryName.*?`r`n.*") {
    $result = $Matches[1]
}
$result

More information at https://stackoverflow.com/a/12573413/447901

edited Feb 23 '22 at 00:03

answered Feb 22 '22 at 22:22

lit

14,456
10
65
119

Great! Thanks. This works. The only negative is that it includes the boundary listing. – Dave Feb 22 '22 at 22:59
Thanks for the SO reference post. Your regex was confusing me. The post will help understanding. – Dave Feb 22 '22 at 23:54
Your change fixed the output end but screwed up the output start. Don't worry. You've been a great help. I have to dig into the details more. So I'll be able to correct it at some stage. Thanks. – Dave Feb 23 '22 at 00:16
1

@Dave, are you wanting the text on the same line after the starting boundary to be in the output? – lit Feb 23 '22 at 00:20
Yeah. My code produced `001 BARTLETT. Lois Elizabeth .......` at the beginning of the output text file. – Dave Feb 23 '22 at 00:23
1

Should the first line of output be `001 BARTLETT. Lois Elizabeth .......`? That is easy enough to get in. – lit Feb 23 '22 at 00:27

Darin · Answer 2 · 2022-02-23T03:04:48.727

1

How close does this come to what you want?

function Process-File {
    param (
        [Parameter(Mandatory = $true, Position = 0)]
        [string]$HeadText,
        [Parameter(Mandatory = $true, Position = 1)]
        [string]$TailText,
        [Parameter(ValueFromPipeline)]
        $File
    )
    Process {
        $Inside = $false;
        switch -Regex -File $File.FullName {
            #'^\s*$' { continue }
            "(?i)^\s*$TailText(?<Tail>.*)`$"    { $Matches.Tail; $Inside = $false }
            '^(?<Line>.+)$'                     { if($Inside) { $Matches.Line } }
            "(?i)^\s*$HeadText(?<Head>.*)`$"    { $Matches.Head; $Inside = $true }
            default { continue }
        }
    }
}
$File = 'Copy of 31832_226140__0001-00006.txt'
#$Path = $PSScriptRoot
$Path = 'C:\test\test3'

$Result = Get-ChildItem -Path "$Path\$File" | Process-File '001 BARTLETT' '001 BEECH'
$Result | Out-File -FilePath "$Path\SpanText.txt"

This is the output:

. Lois Elizabeth
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
, Margaret ..........

edited Feb 23 '22 at 03:04

answered Feb 22 '22 at 22:33

Darin

1,423
1
10
12

1

Some notes on this script. You can get rid of the tail line completely (, Margaret ..........) by removing the "$Matches.Tail;". The "." in front of "Lois Elizabeth" can be removed easily, probably need to insert something like ([.]\s)?, but not sure without experimenting. I believe blank lines are skipped, but lines with only spaces are kept, but that can be changed easily to any way you want. Remove lines with spaces, or keep all lines. Just let me know and I should be able to make the changes. – Darin Feb 22 '22 at 23:18
Great! Thanks. It basically works to requirement. I like your code layout. I take it your approach is replicating UNIX head/tail functionality. So I will take this into account when making changes. – Dave Feb 22 '22 at 23:52
One thing that puzzles me, in your code, is how to swap out hard-coded `001 BEECH` & `001 BARTLETT` for regex escaped variable like `$pattern1` & `$pattern2`?? – Dave Feb 23 '22 at 02:01
1

Dave, you have to be careful doing that. I don't think I've ever tried that, but my first approach would be to replace '(?i)^\s*001 BEECH(?.*)$' with '(?i)^\s*'+$Pattern1+'(?.*)$' and see if that works. The alternate approach would be to replace single quotes 'RegEx' with double quotes "RegEx", but you really have to make sure you know how the double quotes are going to react to each character in the string. It is possible that "(?i)^\s*$pattern1(?.*)$" will work. I will have to experiment with that. – Darin Feb 23 '22 at 02:12
Dave, it worked! Process-File accepts two parameters now, $HeadText and $TailText. Each are placed in the RegEx to give the function new flexibility. – Darin Feb 23 '22 at 03:09
How does the HeadText and TailText need to change? If it is the "001" changing to "002", and then later "003", we could rewrite the RegEx to just look for 3 numeric digits in that place. In RegEx, to match any 3 digit number, 001 to 999, use [0-9][0-9][0-9]. The new RegEx would be: "(?i)^\s*[0-9][0-9][0-9] BARTLETT(?.*)$" – Darin Feb 23 '22 at 03:24
Your suggestion `"(?i)^\s*$pattern1(?.*)$"` works. Thank you. This code only a small part of solution. I am reading all files in folder. Extract header strings. Use headers as boundaries. Processing the text files to have one listing per line. Then renaming each file using header. I have it working for a single text file (originally a single page image file) with 200 entries. However most of the text files for the electoral roll are double pages so I needed to deal with this before processing. – Dave Feb 23 '22 at 03:41
Dave, I re-wrote the code near the same time of your last post, so not sure if you noticed that. Also, the Get-ChildItem command can be written with wildcards to get all files in a folder, and even subfolders, with a given name pattern such as *.txt. Each file object will go down the pipeline one at a time and be processed by Process-File one at a time. If you are not supplying outside information, that is, all info is derived somehow from the files, then the Process-File function could be redesigned to do the full job of creating new files with new names and desired content. – Darin Feb 23 '22 at 14:30

Powershell 7.x How to Select a Text Substring of Unknown Length Only Using Boundary Substrings

2 Answers2