0

I'm working in Notepad++

In the file that I'm working with there is a string pattern of [0-9][0-9]-[0-9][0-9][0-9][0-9]| that every line should start with, immediately followed by a pipe. (A caveat there: the pattern can have up to three capital letters following the four digits. E.g. 00-1324A| or 12-3456STR|).

There are instances in the file where that pattern is in the middle of a line, and needs to be moved to the next line.

Example:

00-1234REV|The quick brown fox jumped over the lazy dog|Test
11-6544|FooBar|text99-8656ST|This needs to be on the next line|some text
45-8737|Peter pipe picked a peck of pickled peppers|TEST2

As I noted within the example, 99-8656ST needs to be moved to the next line, resulting in this:

00-1234REV|The quick brown fox jumped over the lazy dog|Test
11-6544|FooBar|text
99-8656ST|This needs to be on the next line|some text
45-8737|Peter pipe picked a peck of pickled peppers|TEST2

I currently have this regex: (?<=[^\d\r\n])\d{2}-\d{4}(?!\d) but that is matching on parts of social security numbers in the middle of a line:

123-45-6789

My regex will on 45-6789.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
marky
  • 4,878
  • 17
  • 59
  • 103
  • Use numeric boundaries then `(?<=[^\d\r\n])(?<!\r\n)\d{2}-\d{4}(?!\d)`, see https://regex101.com/r/G3JocW/2 – Wiktor Stribiżew Sep 29 '21 at 13:41
  • You can use this regex: `(?<!^|\d)[0-9][0-9]-[0-9][0-9][0-9][0-9](?!\d)` remembering to activate the multiline mode on the Notepad++ find&replace tool. – logi-kal Sep 29 '21 at 13:46
  • @WiktorStribiżew, that's skipping the phone numbers, but still matches on SSNs – marky Sep 29 '21 at 13:47
  • 1
    @horcrux, Notepad++ is complaining that's an invalid regular expression – marky Sep 29 '21 at 13:47
  • Please update the question to see where the numeric boundaries fail. Actually, in my regex, `(?<!\r\n)` became redundant and can be removed. `(?<=[^\d\r\n])\d{2}-\d{4}(?!\d)` will work. – Wiktor Stribiżew Sep 29 '21 at 13:48
  • I mean, `(?<=[^\d\r\n])\d{2}-\d{4}(?!\d)` does what you need. If not, show what is wrong in the question. – Wiktor Stribiżew Sep 29 '21 at 13:49
  • @WiktorStribiżew, I updated the question to include your regex, which still matches on SSNs, as indicated in the updated question. Also, I added a critical piece of information about the pattern to search for: it always is followed by a pipe. – marky Sep 29 '21 at 13:53
  • Include hyphens into boundaries, `(?<=[^\d\r\n-])\d{2}-\d{4}(?!-?\d)` – Wiktor Stribiżew Sep 29 '21 at 13:56
  • Ok, so, it seems you need `(?<=[^\r\n\d-])\d{2}-\d{4}(?=[A-Z]{0,3}\|)`. Or `(?<=[^\d\r\n])(?<!\d-)\d{2}-\d{4}(?=[A-Z]{0,3}\|)`. The left-hand boundary condition is not that clear from your question. – Wiktor Stribiżew Sep 29 '21 at 13:59

1 Answers1

1

Since purely numeric boundaries do not work here, you can add up a check for a digit + hyphen on the left. The right-hand boundary is clear, it is zero to three uppercase letters followed with a pipe.

That means, you can use

(?<=[^\d\r\n])(?<!\d-)\d{2}-\d{4}(?=[A-Z]{0,3}\|)

See the regex demo. Details:

  • (?<=[^\d\r\n]) - immediately on the left, there must be a char other than a digit, CR, LF
  • (?<!\d-) - immediately on the left, there should be no digit + -
  • \d{2}-\d{4} - two digits, -, four digits
  • (?=[A-Z]{0,3}\|) - immediately followed with 0 to 3 uppercase letters and then a literal | char.

If the left-hand boundary can be a single hyphen or digit, then replace (?<=[^\d\r\n])(?<!\d-) with (?<=[^\r\n\d-]).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563