2

I am looking for certain entries with special words in a string. The string looks like this.

entry 1: hello
entry 2: world
entry 3: this
is a multiline
that makes it hard
entry 4: here we have a special entry
entry 5: here
we
have 
another special entry
in a multiline
entry 6: end

Because it is an multiline problem I use Java's DOTALL so that the . matches also newline characters.

I am looking for entries that have the word special in it.

First I tried to find a regex, that captures a full entry: entry \d+: .*?(?=\s*(entry \d: )|\Z). That is like a simplified version of this

Then I thought, ok I just have to exchange the .*? for the regex I need to find. But entry \d+: .*?special.*?(?=\s*(entry \d: )|\Z) does not work, probably because the special breaks the greed of the expression.

Does anyone know a better solution?

Filou
  • 490
  • 4
  • 17

3 Answers3

1

You can use a tempered greedy token:

(?s)entry \d+: (?:(?!entry \d+: ).)*special.*?(?=\s*entry \d+: |$)

See the regex demo. Details:

  • entry \d+: - entry + space + one or more digits, :, space
  • (?:(?!entry \d+: ).)* - any char, repeated zero or more times, that does not start the entry + space + one or more digits, :, space sequence
  • special - a fixed string
  • .*? - any zero or more chars as few as possible
  • (?=\s*entry \d+: |$) - a positive lookahead that matches a location in string that is immediately followed with zero or more whitespaces, entry, space, one or more digits, : and space, or end of the string.

NOTE: Do not use Pattern.MULTILINE with this regex. Or, keep on using \Z (end of the string, or position right before the trailing newline, LF char).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

[Edit:] I unfortunately missed the multiline nature of entries, so this answer is valid for single line entries but will return only the first line for multiline entries. I think one could overcome this by setting a certain regex for delimiter, though.

I'd suggest you use a Scanner to deal with the multi line aspect. This will give you a stream of tokens (the lines). You can use a String.contains(...) or a String.matches(...) to filter tokens then.

var result = new Scanner(myMultiLineString).tokens()
                                           .useDelimiter("\\n")
                                           // alternatively use String.contains(...)
                                           // if you're looking for a constant
                                           // rather than a complex rule.
                                           .filter(s -> s.matches(regex))
                                           .collect(Collectors.toList());
Amadán
  • 718
  • 5
  • 18
0

If you use words and space classes instead of dots then it seems to work

/entry \d+: [\w\s]*special[\w\s]*?(?=\s*(?:entry \d+:)|$)/gm

It seems that if you allow the colon : in your text, it breaks the expression.

And also you have \Z in your expression but it seems to me that end of line $ is more suited here

absence
  • 308
  • 2
  • 16
  • 1
    Also the lookahead should match a multi digit entry numbers as well, so please edit `\d` to `\d+` inside the lookahead. – K450 Dec 08 '21 at 08:10
  • Interesting. Do you know _why_ the colon breaks the expression? – Filou Dec 08 '21 at 08:15
  • I think there is one catch with this solution. The regex only matches an entry up to the line in which the _special_ is included. In the example the match of entry 5 does not match the line before the last ("in a multiline"). – Filou Dec 08 '21 at 08:24
  • I think it finds the first entry and as `entry 123:` now matches `[\w\s:]*` proceeds through all the entries till it finds one with `special` in it – absence Dec 08 '21 at 08:26
  • 1
    Ok, now I get it. Unfortunately, this solution does not solve my problem, because there are colons (as well as other punctuation marks) allowed in the text of the entry. – Filou Dec 08 '21 at 08:46
  • Maybe then it's a good idea to first separate your entries in the list of strings and then search for the special words in these strings – absence Dec 08 '21 at 08:52