1

I'm having the below text.

^0001   HeadOne


@@
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.

^0002   HeadTwo


@@
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.


^004    HeadFour


@@
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

^0004   HeadFour


@@
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.

Below is the regex I'm using to Find.

@@([\n\r\s]*)(.*)([\n\r\s]+)\^

but this is catching only ^0001 and ^0003 as these have only one paragraph, but in my text there are multi para contents.

I'm using VS code, can someone please let me know how can I capture such multi para strings using REGEX in VS code or NPP.

Thanks

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
user3872094
  • 3,269
  • 8
  • 33
  • 71
  • \s is usually whitespace, I'd try removing that or adding a * to indicate zero or more – Calvin Taylor Oct 03 '17 at 10:03
  • What are the expected matches and why? @Calvin, in VSCode, `\s` does not match newlines. How can you define the trailing boundary? Are `@@` always at the start of the line? – Wiktor Stribiżew Oct 03 '17 at 10:11
  • @WiktorStribiżew it's matching directly when using `@@` – user3872094 Oct 03 '17 at 10:16
  • @WiktorStribiżew, I'm trying to match the content that is between `@@` and `^` let it be any number of paragraphs... – user3872094 Oct 03 '17 at 10:24
  • 1
    I have just tried `@@.*(?:[\s\r]?(?!\s*\^).*)*` in the VSCode, and it highlighted all the paragraphs after `@@`. Try. My VSCode version is 1.16.1. To make sure the `@@` are only matched at the start of a line, prepend the pattern with `^`. – Wiktor Stribiżew Oct 03 '17 at 10:27
  • @WiktorStribiżew, Thank you soo much... you saved my day... :) Can you please post it as answer, I want to upvote and accept it. – user3872094 Oct 03 '17 at 10:41

2 Answers2

2

One weird thing about VSCode regex is that \s does not match all line break chars. One needs to use [\s\r] to match all of them.

Keeping that in mind, you want to match all substrings that start with @@ and then stretch up to a ^ at the start of a line or end of string.

I suggest:

@@.*(?:[\n\r]+(?!\s*\^).*)*

See the regex demo

NOTE: To only match @@ at the start of a line, add ^ at the start of the pattern, ^@@.*(?:[\s\r]+(?!\s*\^).*)*.

NOTE 2: Starting with VSCode 1.29, you need to enable search.usePCRE2 option to enable lookaheads in your regex patterns.

Details

  • ^ - start of a line
  • @@ - a literal @@
  • .* - the rest of the line (0+ chars other than line break chars)
  • (?:[\n\r]?(?!\s*\^).*)* - 0 or more consecutive occurrences of:
    • [\n\r]+(?!\s*\^) - one or more line breaks not followed with 0+ whitespace and then ^ char
    • .* - the rest of the line

In Notepad++, use ^@@.*(?:\R(?!\h*\^).*)* where \R matches a line break, and \h* matches 0 or more horizontal whitespaces (remove if ^ is always the first char on a delimiting line).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

I plugged your input data into /tmp/test and got the following to work using perl syntax

grep -Pzo "@@(?:\s*\n)+((?:.*\s*\n)+)(?:\^.*)*\n+" /tmp/test

This should be placing the paragraphe not starting with ^ into $1. You may need to add \r back into this to make it match perfectly

Calvin Taylor
  • 664
  • 4
  • 15