0

Hello I'm trying to use regex to search through a markdown file for a date and only get a match if it finds an instance of a specific string before it finds another date.

This is what I have right now and it definitely doesn't work. (\d{2}\/\d{2}\/\d{2})(string)?(^(\d{2}\/\d{2}\/\d{2}))

So in this instance It would throw a match since the string is before the next date:

01/20/20

string

01/21/20

Here it shouldn't match since the string is after the next date:

01/20/20

this isn't the phrase you're looking for

01/21/20

string

Any help on this would be greatly appreciated.

  • Do you mean like this? `\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string.*?\d{2}\/\d{2}\/\d{2}` https://regex101.com/r/FREPRt/1 – The fourth bird Jan 12 '20 at 14:58

2 Answers2

1

One approach here would be to use a tempered dot to ensure that the regex engine does not cross over the ending date while trying to find the string after the starting date. For example:

inp = """01/20/20

string                  # <-- this is matched

01/21/20

01/20/20

01/21/20

string"""               # <-- this is not matched

matches = re.findall(r'01/20/20(?:(?!\b01/21/20\b).)*?(\bstring\b).*?\b01/21/20\b', inp, flags=re.DOTALL)
print(matches)

This prints string only once, that match being the first occurrence, which legitimately sits in between the starting and ending dates.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

You could match a date like pattern. Then use a tempered greedy token approach (?:(?!\d{2}\/\d{2}\/\d{2}).)* to match string without matching another date first.

If you have matched the string, use a non greedy dot .*? to match the first occurrence of the next date.

\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string.*?\d{2}\/\d{2}\/\d{2}

Regex demo | Python demo

For example (using re.DOTALL to make the dot match a newline)

import re

regex = r"\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}"

test_str = """01/20/20\n\n"
    "string\n\n"
    "01/21/20\n\n"
    "01/20/20\n\n"
    "this isn't the phrase you're looking for\n\n"
    "01/21/20\n\n"
    "string"""

print(re.findall(regex, test_str, re.DOTALL))

Output

['01/20/20\n\n"\n\t"string\n\n"\n\t"01/21/20']

If the string can not occur 2 times between the date, you might use

\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}|string).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}

Regex demo

Note that if you don't want the string and the dates to be part of a larger word, you could add word boundaries \b

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • @TimBiegeleisen It is just the answer for my comment, which looks like your answer afterwards :) But you will get my vote anyway – The fourth bird Jan 12 '20 at 15:29
  • @The Fourth Bird That does match correctly but two things. How do I get it to return all instances of this match? and two How do I get it to return only the initial date after it has confirmed that it is a match. – Captain Po-Po Jan 12 '20 at 15:31
  • 1
    @CaptainPo-Po Fourth's answer already matches all instances of the match +1. For the second requirement, you only need to change the capture group in the call to `re.findall`. – Tim Biegeleisen Jan 12 '20 at 15:35
  • @CaptainPo-Po You could indeed use a capturing group for the first date https://regex101.com/r/khGQ8G/1 Using re.findall will return only the capturing group. If you want the match and the groups, you could use re.finditer. This is an example of the auto generated code by regex101 https://ideone.com/2M19fp – The fourth bird Jan 12 '20 at 15:47
  • Okay yeah I see that now. I've been tinkering and I've run into an error when running this: `TypeError: findall() missing 1 required positional argument: 'string'` i've replaced the phrase "string" with the actual string I need, is one of those a command of some sort? – Captain Po-Po Jan 12 '20 at 16:23
  • @CaptainPo-Po [re.findall](https://docs.python.org/3/library/re.html#re.findall) has this function definition `re.findall(pattern, string, flags=0)` so the second parameter is the actual string. – The fourth bird Jan 12 '20 at 17:33
  • Will this method catch a date if it exists in a larger string? – Captain Po-Po Jan 12 '20 at 17:52
  • Okay, That's odd then. I'm not getting any matches even though there should be at least one – Captain Po-Po Jan 12 '20 at 17:55