1

Hi guys im trying to get the the substring as well as the corresponding number from this string

text = "Milk for human consumption may be taken only from cattle from 80 hours after the last treatment."

I want to select the word milk and the corresponding number 80 from this sentence. This is part of a larger file and i want a generic solution to get the word milk in a line and then the first number that occurs after this word anywhere in that line.

(Milk+)\d

This is what i came up with thinking that i can make a group milk and then check for digits but im stumped how to start a search for numbers anywhere on line and not just immediately after the word milk. Also is there any way to make the search case insensitive?

Edit: im looking to get both the word and the number if possible eg: "milk" "80" and using python

darkCoffy
  • 103
  • 9
  • Try `Milk.*?(\d+)` – Wiktor Stribiżew Feb 28 '20 at 13:36
  • While this gives me the number it does not give me the text. I need both the word and the number extracted – darkCoffy Feb 28 '20 at 13:44
  • By adding ```(?i)``` at the front like so: ```(?i)(milk).*?\d+``` the search for the word "milk" will be case insensitive. But it still returns anything in between the word milk and the number and also it does not pay attention whether both are in the same line. – L483 Feb 28 '20 at 13:47
  • You mean you want to get `Milk 80` as output? You need to replace then, `.replace(/.*\b(Milk)\b.*?(\d+).*/, '$1 $2')` – Wiktor Stribiżew Feb 28 '20 at 13:52

3 Answers3

1

This seems to work in java (I overlooked that the questioner wanted python or the question was later edited) like you want to:

String example =
    "Test 40\n" +
    "Test Test milk for human consumption may be taken only from cattle from hours after the last treatment." +
    "\nTest Milk for human consumption may be taken only from cattle from 80 hours after the last treatment." +
    "\nTest miLk for human consumption may be taken only from cattle from 80 hours after the last treatment.";

Matcher m = Pattern.compile("((?i)(milk).*?(\\d+).*\n?)+").matcher(example);
m.find();
System.out.print(m.group(2) + m.group(3));

Look at how it tests whether the word "milk" appears in a case insensitive manner anywhere before a number in the exact same line and only prints these both. It also prints only the first found occurence (making it find all occurencies is also possible pretty easily just by a little modifications of the given code).

I hope the way it extracts these both things from a matching pattern is in the sense of your task.

L483
  • 170
  • 1
  • 11
  • im doing it in python so im trying to convert this to a pure regex implementation. https://regex101.com/r/5IhGRO/1 can you have a look at it here and see what im doing wrong :( – darkCoffy Feb 28 '20 at 14:19
  • Remove one of the backslasehs before "d+". The double backslash is only needed in some specific programming languages because there the ```\``` itself is an escape character as well. – L483 Feb 28 '20 at 14:22
  • Unfortunately I'm not capable of python but the implementation should be possible in a similar way: Compiling the RegEx String, using it on your input, finding the first matching occurence and give out the capture groups of the "milk" and the number that are in this occurence. – L483 Feb 28 '20 at 14:25
  • I googled a bit. Looks like you can't specify in python in the RegEx itself that it should be case insensitive. Maybe [this post](https://stackoverflow.com/questions/500864/case-insensitive-regular-expression-without-re-compile) or [this site](https://docs.python.org/3/howto/regex.html) are of some help for you. – L483 Feb 28 '20 at 14:37
  • 1
    well you can use the modifier but it shows a deprecated warning. I found the solution using an ignorecase flag. also modified the fact that i needed a new search every. here's my final solution in python. ``` re.findall(r'((milk).*?(\d+).?)+', text, re.IGNORECASE) ``` – darkCoffy Feb 28 '20 at 14:46
1
/(?<!\p{L})([Mm]ilk)(?!p{L})\D*(\d+)/

This matches the following strings, with the match and the contents of the two capture groups noted.

"The Milk99"             # "Milk99"     1:"Milk" 2:"99" 
"The milk99 is white"    # "milk99"     1:"milk" 2:"99"
"The 8 milk is 99"       # "milk is 99" 1:"milk" 2:"99"
"The 8milk is 45 or 73"  # "milk is 45" 1:"milk" 2:"45"

The following strings are not matched.

"The Milk is white"
"The OJ is 99"
"The milkman is 37"
"Buttermilk is 99"
"MILK is 99"

This regular expression could be made self-documenting by writing it in free-spacing mode:

/
(?<!\p{L}) # the following match is not preceded by a Unicode letter
([Mm]ilk)  # match 'M' or 'm' followed by 'ilk' in capture group 2
(?!p{L})   # the preceding match is not followed by a Unicode letter
\D*        # match zero or more characters other than digits
(\d+)      # match one or more digits in capture group 2 
/x         # free-spacing regex definition mode

\D* could be replaced with .*?, ? making the match non-greedy. If the greedy variant were used (.*), the second capture group for "The 8milk is 45 or 73" would contain "3".

To match "MILK is 99", change ([Mm]ilk) to (?i)(milk).

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • This a good solution. is it possible to include a case where it will stop searching once it encounters a period? eg : "Milk for human consumption may be taken only from cattle from after the last treatment. Meat can be taken in 4 days." the regex given will return 4. can we make it stop at the period and not take the next line into consideration as milk does not come in that line? – darkCoffy Mar 02 '20 at 13:12
0

You should try this one

(Milk).*?(\d+)

Based on your language, you can also specify a case-insensitive search. Example in JS: /(Milk).*?(\d+)/i, the final i makes the search case insensitive.

Note the *?, the most important part ! This is a lazy iteration. In other words, it reads any char, but as soon as it can stop and process the next instruction successfully then it does. Here, as soon as you can read a digit, you read it. A simple * would have returned the last number from this line after Milk instead

David Amar
  • 247
  • 1
  • 5
  • But it does not test whether the number and the word milk are in the same line in a multi-line input. – L483 Feb 28 '20 at 13:50