3

I'm pretty new to regex, but I did some study. I got into a problem that might turn out impossible to be solved by regex, so I need a piece of advice.

I have the following string:

some text key 12, 32, 311 ,465 and 345. some other text dog 612, 
12, 32, 9 and 10. some text key 1, 2.

I'm trying to figure if it possible (using regex only) to extract the numbers 12 32 311 465 345 1 2 only - as a set of individual matches.

When I approach this problem I tried to look for a pattern that matches only the relevant results. So I came up with :

  • get numbers that have prefix of "key" and NOT have prefix of "dog".

But I'm not sure if it is even possible. I mean that I know that for the number 1 I can use (?<=key )+[\d]+ and get it as a result, but for the rest of the numbers (i.e. 2..5), can I "use" the key prefix again?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Gil Peretz
  • 2,399
  • 6
  • 28
  • 44
  • 1
    Could you try to rewrite your question? It's not clear. Do you want to extract numbers or digits? What would you like to get from 11, 51, 1dog, dog1? – xenteros Jul 23 '15 at 06:38
  • Exctracting numbers. the point is to extract only numbers that are followed by `key` string and NOT by `dog` string – Gil Peretz Jul 23 '15 at 06:46
  • Are you expecting a single string match (i.e. `"12, 32, 311 ,465 and 345"`), or are you looking for a set of individual matches (i.e. `{12,32,311,465,345}`)? – Simon MᶜKenzie Jul 23 '15 at 06:50
  • I'm expecting for set of individual matches. (edited my question). thank you – Gil Peretz Jul 23 '15 at 06:54

4 Answers4

3

In Java, you can make use of a constrained width look-behind that accepts {n,m} limiting quantifier.

So, you can use

(?<=key(?:(?!dog)[^.]){0,100})[0-9]+

Or, if key and dog are whole words, use \b word boundary:

String pattern = "(?<=\\bkey\\b(?:(?!\\bdog\\b)[^.]){0,100})[0-9]+";

The only problem there may arise if the distance between the dog or key and the numbers is bigger than m. You may increase it to 1000 and I think that would work for most cases.

Sample IDEONE demo

String str = "some text key 12, 32, 311 ,465 and 345. some other text dog 612,\n12, 32, 9 and 10. some text key 1, 2.";
String str2 = "some text key 1, 2, 3 ,4 and 5. some other text dog 6, 7, 8, 9 and 10. some text, key 1, 2 dog 3, 4 key 5, 6";
Pattern ptrn = Pattern.compile("(?<=key(?:(?!dog)[^.]){0,100})[0-9]+");
Matcher m = ptrn.matcher(str);
while (m.find()) {
   System.out.println(m.group(0));
}
System.out.println("-----");
m = ptrn.matcher(str2);
while (m.find()) {
   System.out.println(m.group(0));
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

I wouldn't recommend using code that you can't understand and customize, but here is my one-pass solution, using the method described in this answer of mine. If you want to understand the construction method, please read the other answer.

(?:key(?>\s+and\s+|[\s,]+)|(?!^)\G(?>\s+and\s+|[\s,]+))(\d+)

Compared to the method described in the other post, I dropped the look-ahead, since in this case, we don't need to check against a suffix.

The separator here is (?>\s+and\s+|[\s,]+). It currently allows "and" with spaces on both sides, or any mix of spaces and commas. I use (?>pattern) to inhibit backtracking, so the order of the alternation is significant. Change it back to (?:pattern) if you want to modify it and you are unsure of what you are doing.

Sample code:

String input = "some text key 12, 32, 311 ,465 and 345. some other text dog 612,\n12, 32, 9 and 10. some text key 1, 2. key 1, 2 dog 3, 4 key 5, 6. key is dog 23, 45. key 4";
Pattern p = Pattern.compile("(?:key(?>\\s+and\\s+|[\\s,]+)|(?!^)\\G(?>\\s+and\\s+|[\\s,]+))(\\d+)");
Matcher m = p.matcher(input);
List<String> numbers = new ArrayList<>();

while (m.find()) {
    numbers.add(m.group(1));
}

System.out.println(numbers);

Demo on ideone

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • This is really the correct regex +1. Was going to post similar but dropped due to loose nature of input data structure. – anubhava Jul 23 '15 at 08:13
1

You can use a positive look behind which ensure that your sequence doesn't precede by any word except key :

(?<=key)\s(?:\d+[\s,]+)+(?:and )?\d+

Note that here you don't need to use a negative look behind for dog because this regex will just match if your sequence precede by key.

See demo https://regex101.com/r/gZ4hS4/3

Mazdak
  • 105,000
  • 18
  • 159
  • 188
1

You can do it in 2 steps.

(?<=key\\s)\\d+(?:\\s*(?:,|and)\\s*\\d+)*

Grab all the numbers.See demo.

https://regex101.com/r/uK9cD8/6

Then split or extract \\d+ from it.See demo.

https://regex101.com/r/uK9cD8/7

vks
  • 67,027
  • 10
  • 91
  • 124
  • I'm afraid two steps is not an option for me. is it possible in single step? – Gil Peretz Jul 23 '15 at 07:23
  • 1
    @GilPeretz 2 steps is a sure shot way of achieving what you want.You anyways cannot do it in a single regex with 100 % accuracy. – vks Jul 23 '15 at 07:25
  • Initially, you were looking for a single regex to do it. I see the limitations of a constrained width lookbehind and nested lookarounds scared you off. Anyway, it's great you have solutions to choose from! – Wiktor Stribiżew Jul 23 '15 at 14:51