0

Need help debugging Regex

I have a string column in pandas data frame that contains dates formatted as follows. And there is only one such date in each string.

semicolons are only used to deliminate dates here and not present in actual strings
04/20/2009; 04/20/09; 4/20/09; 4/3/09; 011/14/83;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010

My job is to extract these using regex. Here is the pattern I came up with.

my_pattern = r"((?:(\d{0,2}\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)?[, -./]{0,2}(?:(\d{1,2})[dhnst]{0,2}|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)[, -./]{1,2}(\d{2,4}))|(\d{4})"

sample_series.str.extract(my_pattern, expand=False)

regex_problem_image

So far, I see it work for every date except for the format "Jan 27, 1983", it matches the month name and the date. But the year isn't matched. I am relatively new to regex and I think my pattern design is quite bad too. I need help figuring out what's wrong with my regex expression and how I could debug or improve it. Thanks.

Here is the sample data to make the problem reproducible.

sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
       '.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
       '4-13-89 Communication with referring physician?: Not Done\n',
       '7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 @ 12 AM\n',
       '.  Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
       '1-14-81 Communication with referring physician?: Done\n',
       '. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
       '09/14/2000 CPT Code: 90792: With medical services\n',
       '. Sep 2015- Transferred to Memorial Hospital from above.  Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
       'Born and raised in Fowlerville, IN.  Parents divorced when she was young, states that it was a "bad" divorce.  Received her college degree from Allegheny College in 2003.  Past verbal, emotional, physical, sexual abuse: No\n']
sample_series = pd.Series(sample_list)
  • 1
    Take a look at the post [Regex to validate date format dd/mm/yyyy with Leap Year Support](https://stackoverflow.com/questions/15491894/regex-to-validate-date-format-dd-mm-yyyy-with-leap-year-support/65999465#65999465) for a rich set of regex supporting quite a lot of date formats. You can try modifying the regex there. – SeaBean Sep 30 '21 at 06:53

1 Answers1

0

From your data :

>>> import pandas as pd 

>>> sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
       '.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
       '4-13-89 Communication with referring physician?: Not Done\n',
       '7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 @ 12 AM\n',
       '.  Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
       '1-14-81 Communication with referring physician?: Done\n',
       '. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
       '09/14/2000 CPT Code: 90792: With medical services\n',
       '. Sep 2015- Transferred to Memorial Hospital from above.  Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
       'Born and raised in Fowlerville, IN.  Parents divorced when she was young, states that it was a "bad" divorce.  Received her college degree from Allegheny College in 2003.  Past verbal, emotional, physical, sexual abuse: No\n']
>>> sample_series = pd.Series(sample_list)
>>> df = sample_series.to_frame()
>>> df
    0
0   .Got back to U.S. Jan 27, 1983.\n
1   .On 21 Oct 1983 patient was discharged from Sc...
2   4-13-89 Communication with referring physician...
3   7intake for follow up treatment at Anson Gener...
4   . Pt diagnosed in Apr 1976 after he presented...
5   1-14-81 Communication with referring physician...
6   . Went to Emerson, in Newfane Alaska. Started ...
7   09/14/2000 CPT Code: 90792: With medical servi...
8   . Sep 2015- Transferred to Memorial Hospital f...
9   Born and raised in Fowlerville, IN. Parents d...

We can use a tool called datefinder to find the date in each row :

>>> import datefinder
>>> def find_date(df):
...     return [match for match in datefinder.find_dates(df[0])]
            
>>> df["Vals"] = df.apply(find_date, axis=1)
>>> df
    0                                                   Vals
0   .Got back to U.S. Jan 27, 1983.\n                   [1983-01-27 00:00:00]
1   .On 21 Oct 1983 patient was discharged from Sc...   [1983-10-21 00:00:00]
2   4-13-89 Communication with referring physician...   [1989-04-13 00:00:00]
3   7intake for follow up treatment at Anson Gener...   []
4   . Pt diagnosed in Apr 1976 after he presented...    [1976-04-30 00:00:00, 2021-09-02 00:00:00, 202...
5   1-14-81 Communication with referring physician...   [1981-01-14 00:00:00]
6   . Went to Emerson, in Newfane Alaska. Started ...   [2002-09-30 00:00:00]
7   09/14/2000 CPT Code: 90792: With medical servi...   [2000-09-14 00:00:00]
8   . Sep 2015- Transferred to Memorial Hospital f...   [2015-09-30 00:00:00]
9   Born and raised in Fowlerville, IN. Parents d...    [2003-09-30 00:00:00]
tlentali
  • 3,407
  • 2
  • 14
  • 21