1

I need help on extracting some words from this sentence:

String keywords = "I like to find something vicous in somewhere bla bla bla.\r\n" + 
            "https://address.suffix.com/level/somelongurlstuff";

And my matching code looks somewhat like this:

    keywords = keywords.toLowerCase();
    regex = "(I like to find )(.*)( in )(.*)(\\.){1}(.*)";
    regex = regex.toLowerCase();
    keywords = keywords.replaceAll(regex, "$4 $2"); //"$4 $2");

And I want to extract the words between find and in and between in and the first dot. however, as the url has multiple dots, some weird stuff starts happening and I get what I need PLUS the url wit dots replaced with empty spaces. I want the url to be gone, because it's supposed to be the matched with (.*) in my case, and I only need one dot after my words with (\\.){1}, so I wonder what's going wrong there? Any ideas?

By adding (?s) or doing removing all new line characters on the line before matching on the regex gives you something like: somewhere bla bla bla address suffix something vicious so the problem with the url without having dots still being left there persists.

This is NOT just about matching multiline text.

jaco0646
  • 15,303
  • 7
  • 59
  • 83
Arturas M
  • 4,120
  • 18
  • 50
  • 80
  • Add `(?s)` in front of the pattern to enable the DOTALL mode and force `.` to match any character including a newline. And remove `{1}` that is redundant. – Wiktor Stribiżew Apr 22 '16 at 09:22
  • @WiktorStribiżew this doesn't solve the problem and it doesn't have much in common with the other question that you stated this question to be a duplicate of. By adding (?s) or doing removing all new line characters on the line before matching on the regex gives you something like: "somewhere bla bla bla https://address suffix something vicious" so the problem with the url without having dots still being left there... – Arturas M Apr 22 '16 at 10:02
  • Well, your question sounds rather unclear (maybe formatting could help?). I guess you just need both DOTALL and lazy matching: [`(?s)(I like to find )(.*)( in )(.*?)(\.)(.*)`](https://regex101.com/r/zZ2hG7/1). Or [`(I like to find )(.*)( in )([^.]*)(\.)(.*)`](https://regex101.com/r/zZ2hG7/2). – Wiktor Stribiżew Apr 22 '16 at 10:12
  • Also, if you need it before the first `" in "`, use [`(I like to find )(.*?)( in )([^.]*)(\.)(.*)`](https://regex101.com/r/zZ2hG7/3). – Wiktor Stribiżew Apr 22 '16 at 10:21
  • @WiktorStribiżew The (?s)(I like to find )(.*)( in )(.*?)(\.)(.*) solved it, thanks. However I don't understand why. What happens by adding the "(.*?)", what does it do exactly? Supposed to be a relucant quanitfier, but I dont' understand what it does, since the (.*) supposed to go only to the next (\.) anyway – Arturas M Apr 22 '16 at 10:32
  • No, the reluctant (lazy) quantifier makes the engine match as few characters as possible between the lazily quantified subpattern and the next subpattern. – Wiktor Stribiżew Apr 22 '16 at 10:33
  • @WiktorStribiżew Aha, I see, I was thinking about that, but then why would it work well without this reluctant (lazy) quantifier before the " in "? - (.*)( in ) it then really just goes to the ( in ) and stops. but in the second case it somehow needs to have this quanitfier – Arturas M Apr 22 '16 at 11:31
  • That is actually where *backtracking* comes into play. Let me edit the answer. – Wiktor Stribiżew Apr 22 '16 at 11:33
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109933/discussion-between-arturas-m-and-wiktor-stribizew). – Arturas M Apr 22 '16 at 15:33

1 Answers1

0

You need two things to fix: 1) add the DOTALL modifier since you have text that spans across multiple lines and 2) use lazy dot matching or - more efficient - a negated character class [^.] to match characters up to the first . after in:

(?s)(I like to find )(.*)( in )([^.]*)(\.)(.*)
                               ^^^^^^^

See the regex demo

However, the best one would be this one:

(?s)(I like to find )(.*?)( in )([^.]*)(\.)(.*)

The reluctant (lazy) quantifier makes the engine match as few characters as possible between the lazily quantified subpattern and the next subpattern. If we use .* before ( in ), backtracking will occur, that is, the whole string after "I like to find " will be grabbed by the regex engine, and then the engine will move backwards looking for the last in . Thus, using .*? will match up to the first in .

Instead of [^.]* you can use a . with a reluctant quantifier *? to match up to the first dot, but it is costlier in terms of performance since the engine expands the subpattern upon each fail it comes across when trying to match the string with the subsequent subpatterns.

Check my answer for Perl regex matching optional phrase in longer sentence to understand how greedy and lazy (=reluctant) quantifiers work.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Note that the first regex needs 268 steps to complete the match and the second one - just 85. Use lazy matching to get as few characters between two subpatterns as possible. – Wiktor Stribiżew Apr 22 '16 at 11:42