How to exclude two words from regex?

Question

I have this regex:

\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*([\w\s][^cui]+)

That should match

] AN 1 words 2 words 3 words

or

] AV 1 words 2 words 3 words

The words after 3 should exclude "da cui", so "da\scui", but it doesn't work. Try it here: https://regex101.com/r/kI7Tan/1

What am I doing wrong?

Sample string:

campo]  AN  1 campo   2 prato  con penna B sps a  1   3 da cui campo con penna C as a  1  cfr Nota  filologica

Expected output: it won't match it because of the "da cui". So basically I want to match all words without the string "da cui".

Sorry, what is the string you have trouble with, and what is the expected output? — Wiktor Stribiżew, May 06 '20 at 15:30
It's not clear what you want to match and what you don't want to match. You should provide a list of example strings and state which patterns in those strings should be matched or not. — Paolo, May 06 '20 at 15:34
If it should not match at all, you can exclude it using a negative lookahead https://regex101.com/r/UhF3Rc/1 — The fourth bird, May 06 '20 at 15:35
I want to match the pattern: 1 words 2 words 3 words. The words after the number 3 do not have to contain "da cui". — Anna, May 06 '20 at 15:36
Use `\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*((?:(?!cui).)*)`, see [regex demo](https://regex101.com/r/kI7Tan/3). — Wiktor Stribiżew, May 06 '20 at 15:37
@Anna I [posted an answer](https://stackoverflow.com/a/61640177/3832970). — Wiktor Stribiżew, May 06 '20 at 16:24
You question is not clear. You state, "The words after 3 should exclude 'da cui'...", but your regex only references "cui". I have stated my understanding of the question in my answer. Others have interpreted it differently. It's too late for you to clarify, however, as doing so would effectively change the question substantively, as it would render at least one of answers incorrect, even though it may be correct in terms of the author's understanding of the question at the time. Just view each answer in terms of its stated or implied understanding of the question. — Cary Swoveland, May 07 '20 at 00:24

score 1 · Answer 1 · answered May 06 '20 at 15:38

1

The final capture group of the regex ( ([\w\s][^cui]+) ) matches ...

Exactly 1 word character due to the first character class. This class does not match a whitespace due to the preceding \s* in the regex.
Any number of characters other than c, u, i.

If you want to exclude matches contingent on the word(s) da cui, use a negative lookahead.

\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*(?!.*da cui)(.*)

See the demo (regex101).

Update

Capture group reintroduced to the regex.

answered May 06 '20 at 15:38

collapsar

17,010
4
35
61

You evidently interpreted, "The words after 3 should exclude 'da cui'..." as meaning that there should be no match of the string if "da cui" follows "3" (and the space(s) that follow). That interpretation may or may not be what the OP had in mind, but I wanted readers to understand that your answer is based on that interpretation of the question. (Correct me if I am wrong.) The same could be said of the other answers as well, as all reflect different interpretations of the question. – Cary Swoveland May 07 '20 at 00:38
1

@CarySwoveland My interpretation (which is reflected in the negative lookahead with the match-all subterm included) is that `da cui` should not occur in any position after the 3, allowing for words in between. Granted, whether that was the OP's intent is unclear. Filtering for words should probably not be delegated to regexen anyway. – collapsar May 07 '20 at 00:57

score 0 · Answer 2 · answered May 06 '20 at 16:23

You may use either of the two:

\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*((?:(?!cui).)*)
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*(.*?)(?=cui|$)

See the regex demo

The (?:(?!cui).)* is a tempered greedy token that matches any char, 0 or more occurrences, as many as possible, that does not start a cui char sequence. The (.*?)(?=cui|$) pattern captures 0+ chars other than line break chars, as few as possible, up to the cui char sequence or end of string.

Cary Swoveland · Answer 3 · 2020-05-07T00:11:49.413

My interpretation of the question, as it concerns the string that follows one or more spaces after 3 (to the end of the line), is that if the string da cui is present in that string an empty string is to be saved to capture group 4, else that string is to be saved to capture group 4.

You could use the following regular expression.

\]\s*(AN|AV)\s+1\s+([\w\s]+)\s+2\s+([\w\s]+)\s+3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*)

Demo

This replaces 3\s*([\w\s][^cui]+) in the OP's regex with 3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*).

Python's regex engine performs the following steps after matching 3.

\s+                  match 1+ spaces
(                    begin capture group 4
  (?=.*\bda cui\b)   match 0+ chars, then 'da cui' in a positive lookahead 
  |                  or
  (?!=.*\bda cui\b)  match 0* chars, then 'da cui' in a negative lookahead
  .*                 match rest of line
)                    end capture group 4

If the positive lookahead succeeds an empty string is saved to the capture group.

How to exclude two words from regex?

3 Answers3