1

Consider the following text file;

NETHERLANDS (THE)
BOLIVIA (PLURINATIONAL STATE OF)
COCOS (KEELING) ISLANDS (THE)
ANTIGUA AND BARBUDA

TEST1, SOME TEXT
TEST2, SAINT HELENA AND ASCENSION AND TRISTAN DA CUNHA
TEST3, BONAIRE AND SINT EUSTATIUS AND SABA

I'm trying to capture all characters after the first , and optionally separated by AND, the desired result is:

No Match (no ,)
No Match (no ,)
No Match (no ,)
No Match (no ,)

SOME TEXT
SAINT HELENA - ASCENSION - TRISTAN DA CUNHA
BONAIRE - SINT EUSTATIUS - SABA

Using this post as an example, I've created the following regex:

/(?<= AND |\, )(.*)(?= AND |$)/mU

Regex101

This works fine, as you can see here, except for the one case that does not contain a , (ANTIGUA AND BARBUDA)


Question: How can I change this regex so that it will only match lines that contains at leat one ,?
I've searched online for a solution, like this or this answer, unfortunately I was not able to add those fixes without breaking the positive lookbehind.
0stone0
  • 34,288
  • 4
  • 39
  • 64

2 Answers2

3

Fortunately it's PCRE and you are able to use \G:

(?>,|\G(?!\A) +AND) +\K(?>(?! +AND).)+

See live demo here

To accelerate matching process, matching ^[^,]* before , will help:

(?>^[^,]*,|\G(?!\A) +AND) +\K(?>(?! +AND).)+

Explanation

At first we have two choices to go with: 1) matching , or 2) \G(?!\A). \G(?!\A) implies that match should be continued from where previous match ended. So it should always match a , before going any further.

After matching , we try to match anything else that comes before an AND. This is done by this part:

 +\K(?>(?! +AND).)+
^ This is a space!

\K meta-character here is responsible to not including matches that are matched so far. In other words it is a match resetter. Since you don't need anything that comes early, we use \K to remove them from output.

After a complete match the next one should start with the second alternation, which is:

\G(?!\A) +AND

It looks for an AND with preceding spaces then again we have our desired pattern.

revo
  • 47,783
  • 14
  • 74
  • 117
  • 1
    Thank you! Exactly what I was trying to achieve. Unfortunately I find it very difficult to see on regex101 exactly what, and why you did it this way. Could you maybe give me a little explanation? – 0stone0 Jun 11 '20 at 14:51
  • 1
    You are welcome. I made some elaborations you may find helpful. – revo Jun 11 '20 at 15:03
1

Converting my comment to answer.

This regex may work for OP:

(?:^[^,]*, |\G(?!^) AND )\K.+?(?= AND |$)

Updated RegEx Demo

RegEx Details:

  • (?:: Start non-capture group
    • ^[^,]*,: Line start followed by 0 or more non-comma character followed by comma and space
    • |: OR
    • \G(?!^) AND: Start from last matched position and match " AND "
  • ): End non-capture group
  • \K: Reset all match info
  • .+?: Match 1 or more of any character (non-greedy)
  • (?= AND |$): Positive lookahead to assert that we have " AND " or line end ahead of us.
anubhava
  • 761,203
  • 64
  • 569
  • 643