1

I'm using this regex:

\([^)]+\d{4}\)

to match scientific citations (they are between parentheses and end with a year):

Text text text (Hung et al., 2020; Sung et al., 2021) text text

Now I want to match everything that is not a scientific citation (in this case, Text text text and text text). I tried using a negative lookahead:

(?!\([^)]+\d{4}\))

But when I tried to replace the matches with nothing, nothing was replaced.

What could be the problem and how to fix it?

Regex101

alexchenco
  • 53,565
  • 76
  • 241
  • 413
  • What *exactly* do you want to match? `Text text text` and `text text`? – InSync Mar 25 '23 at 07:02
  • You could simply split the string on the bits that match your regex. For example, in Ruby, if the variable `str` holds the string in your example, `str.split(/\([^)]+\d{4}\/) #=> ["Text text text ", " text text")`. To also remove unwanted preceding and trailing spaces, `str.split(/\([^)]+\d{4}\/).map(&:strip) #=> ["Text text text", "text text")`. – Cary Swoveland Mar 25 '23 at 09:17

2 Answers2

2

Depending on the regex flavor, you could use either a capture group:

\([^)]+\d{4}\)|(\S.*?)(?=\s*(?:\([^)]+\d{4}\)|$))

Explanation

  • \([^)]+\d{4}\) Match the scientific pattern
  • | or
  • (\S.*?) Capture group 1, start with a non whitespace char and match 0+ chars, as few as possible
  • (?=\s*(?:\([^)]+\d{4}\)|$)) Positive lookahead, assert either the scientific pattern directly to the right, or the end of the string followed by optional whitespace chars

Regex demo

Or with pcre using a SKIP FAIL approach:

\([^)]+\d{4}\)(*SKIP)(*FAIL)|\S.*?(?=\s*(?:\([^)]+\d{4}\)|$))

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

PCRE2:

\([^)]+\d{4}\)         # Match a scientific citation
|                      # or
(?<=^|\s)              # something preceded by the beginning of the line or a whitespace
(?:                    # that consists of
  .(?!\([^)]+\d{4}\))  #             characters not followed by a scientific citation.
)+                     # one or more

This solution captures both wanted and unwanted results, therefore you'll need to filter them using a programming language. Also check this answer for an explanation on the technique.

Try it on regex101.com.

InSync
  • 4,851
  • 4
  • 8
  • 30
  • Match and capture what you want, only match want you don't want, then in code extract captures (an established technique, not something I just invented). – Cary Swoveland Mar 25 '23 at 08:22
  • Correct, but that requires another pair of parentheses, which further, and unnecessarily, complicates the answer. I'll add a link to an excellent answer on that matter. – InSync Mar 25 '23 at 08:33