0

I have several thousand text files containing form information (one text file for each form), including the unique id of each form.

I have been trying to extract just the form id using regex (which I am not too familiar with) to match the string of characters found before and after the form id and extract only the form ID number in between them. Usually the text looks like this: "... 12 ID 12345678 INDEPENDENT BOARD..."

The bolded 8-digit number is the form ID that I need to extract.

The code I used can be seen below:

$id= ([regex]::Match($text_file, "12 ID (.+) INDEPENDENT").Groups[1].Value)

This works pretty well, but I soon noticed that there were some files for which this script did not work. After investigation, I found that there was another variation to the text containing the form ID used by some of the text files. This variation looks like this: "... 12 ID 12345678 (a.12(3)(b),45)..."

So my first challenge is to figure out how to change the script so that it will match the first or the second pattern. My second challenge is to escape all the special characters in "(a.12(3)(b),45)".

I know that the pipe | is used as an "or" in regex and two backslashes are used to escape special characters, however the code below gives me errors:

$id= ([regex]::Match($text_one_line, "34 PR (.+) INDEPENDENT"|"34 PR (.+) //(a//.12//(3//)//(b//)//,45//)").Groups[1].Value)

Where have I gone wrong here and how I can fix my code?

Thank you!

Saewon Park
  • 140
  • 6
  • Your use of quote and pipe `"|"` is wrong, the first part makes a string `"34 PR (.+) INDEPENDENT"` and then the pipe is outside the string and is not valid PowerShell, and the following string doesn't go anywhere. And in regex, one backslash is used to escape special characters, and two backslashes is a special backslash-escaping-itself. `\\` is a backslash, you've used forward slashes. In some languages, a backslash is also the escape character for that language, that doesn't apply to PowerShell which uses backtick as an escape character. – TessellatingHeckler Oct 26 '17 at 04:53

1 Answers1

2

When you approach a regex pattern always look for fixed vs. variable parts. In your case the ID seems to be fixed, and it is, therefore, useful as a reference point.

The following pattern applies this suggestion: (?:ID\s+)(\d{8})
(click on the pattern for an explanation).

$str = "... 12 ID 12345678 INDEPENDENT BOARD..."
$ret = [Regex]::Matches($str, "(?:ID\s+)(\d{8})")
for($i = 0; $i -lt $ret.Count; $i++) {
    $ret[0].Groups[1].Value
}

Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. It contains a treasure trove of useful information.

wp78de
  • 18,207
  • 7
  • 43
  • 71