-3

I have a regex which detects the date of birth in the given paragraph text.

import re

dob = re.compile(r'(?:\bbirth\b|\bbirth(?:day|date)).{0,20}\n? \b((?:(?<!\:)(?<!\:\d)[0-3]?\d(?:st|nd|rd|th)?\s+(?:of\s+)?(?:jan\.?|january|feb\.?|february|mar\.?|march|apr\.?|april|may|jun\.?|june|jul\.?|july|aug\.?|august|sep\.?|september|oct\.?|october|nov\.?|november|dec\.?|december)|(?:jan\.?|january|feb\.?|february|mar\.?|march|apr\.?|april|may|jun\.?|june|jul\.?|july|aug\.?|august|sep\.?|september|oct\.?|october|nov\.?|november|dec\.?|december)\s+(?<!\:)(?<!\:\d)[0-3]?\d(?:st|nd|rd|th)?)(?:\,)?\s*(?:\d{4})?|\b[0-3]?\d[-\./][0-3]?\d[-\./]\d{2,4})\b',re.IGNORECASE | re.MULTILINE)

data = " Hi This is Goku and my birthday is on 6th Aug but to be clear it is on 1994-08-06."

l = dob.findall(data)

print(l)

o/p: ['6th Aug ']

I just want to add one more feature like if something in this format YYYY-MM-DD is present in the text, then that should also be the date of birth.

(where YYYY --> 19XX-20XX , MM --> 01-12 , DD --> 01-31)

For Ex:

data = " Hi This is Goku and my birthday is on 6th Aug but to be clear it is on 1994-08-06."

Then the output should be

output: ['6th Aug ', '1994-08-06']

where can i add the part in the regex so it would detect this YYYY-MM-DD format also.??

  • So basically a regex that detect any dates in a string? – Artog Sep 11 '19 at 11:10
  • Not exactly, I need the output as i mentioned above. – Jin Kazama Sep 11 '19 at 11:14
  • 2
    What about this string: `Hi my name is August von Spiff the third, but people call me Aug 3rd. I'm gonna celebrate my birthday on May 4th (2019-05-04), but the actual day is on the 7th of may` ? – Artog Sep 11 '19 at 11:29
  • 2019-05-04 this should be detected. And the output should be ['May 4th', '2019-05-04'] – Jin Kazama Sep 11 '19 at 11:32
  • @JinKazama but August von Spiff's birthday is May 7th so why do you want to extract the wrong data? – MonkeyZeus Sep 11 '19 at 12:28
  • Its not about extracting the wrong data or right data ( I am not performing any machine learning over here ;)... ).... May 7th won't be extracted because i consider text part of 20 charecters before the date. ( see this in regex .{0,20} ) – Jin Kazama Sep 11 '19 at 12:36
  • Honestly, you have created a very complex regex. May I ask why is it that you cannot figure out how to add the simple `YYYY-MM-DD` format? – MonkeyZeus Sep 11 '19 at 12:55
  • @MonkeyZeus I tried in all possible ways but I was not able to get the results( with YYYY-MM-DD format also) . Thats why i raised a qstn in stackoverflow waiting for some experts to answer. And i dont think that the regex is complex if someone's good in regex :) – Jin Kazama Sep 11 '19 at 13:10
  • If you wrote that from scratch then I would say that you're at least a "little" good at regex. If you haven't stumbled upon regex visualizers yet then I would like to introduce you to https://regex101.com/ and on the left side you can select Python as your regex flavor. – MonkeyZeus Sep 11 '19 at 13:14
  • At regex101 you should enable the `/x` modifier so that you can break out your regex across multiple lines without telling the engine to search for new lines. See my answer at https://stackoverflow.com/a/57698506/2191572 for an example of how much better `/x` makes things in terms of readability. – MonkeyZeus Sep 11 '19 at 13:16

1 Answers1

0

this will detect YYYY-MM-DD

re.search('([0-9]+-)+[0-9]+',data).group()

output:

'1994-08-06'
Derek Eden
  • 4,403
  • 3
  • 18
  • 31
  • I just need some modifications in my regex that would give the output as i showed in the question. Your code is not useful here. Please read my question clearly. – Jin Kazama Sep 11 '19 at 11:45