0

I am trying to extract a date from a string using regex in Python 3 (using Anaconda & Jupyter Notebooks on Win10). This seems like a simple task but the pattern I'm trying to find doesn't seem to want to extract the date properly using the string that I need to work with. The pattern appears to work properly for other strings however. Here's my code:

# Trying to extract this text sequence from the test string: 'April 2, 2020 at 12:30PM'.
import re

test = 'Hampden\n \nUnknown\n \nYes\n \nFemale\n \n30s\n \nSuffolk\n \nYes\n \nYes\n 
\n\nPAGE2\nMASSACHUSETTS DEPARTMENT OF PUBLIC HEALTH\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 
\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nData are cumulative and current as of 
\nApril \n2\n, 2020 at 12:30PM.\n \n*Other commercial and clinical laboratories continue to come on 
line. As laboratory testing results are \nprocessed and the source verified, they will be integrated 
into this daily report\n.\n \nL\naboratory\n \nTotal\n \nPatients \nPositive\n \nTotal Patients 
Tested\n \nMA State Public Health Laboratory\n \n912\n \n7355\n \nARUP\n \nLaboratories\n \n17\n 
\n187\n \nBaystate Medical Center\n \n5\n \n12\n \nBedford \nResearch Foundation\n \n30\n \n219\n 
\nBeth Israel Deaconess Medical Center\n \n686\n \n3976\n \nBioReference Laboratories\n \n9\n \n43\n 
\nBoston Medical Center\n \n236\n \n647\n \nBROAD Institute CRSP\n \n348\n \n1784\n \nCenters for 
Disease Control and Prevention\n \n1\n \n11\n \n\n \n11\n \n195\n \nGenesys Diagnostics\n \n106\n 
\n145\n \nLabCorp\n \n715\n \n4836\n \nMayo Clinic Labs\n \n353\n \n1812\n \nPartners Healthcare\n 
\n740\n \n4112\n \nQuest Laboratories\n \n4161\n \n27606\n \nSouth Shore Hospital\n \n7\n \n34\n 
\nTufts Medical Center\n \n463\n \n2154\n \nUMASS Memorial Medical Center\n \n32\n \n163\n 
\nViracor\n \n80\n \n1150\n \nOther\n \n54\n \n167\n \nTotal\n \nPatients Tested*\n \n8966\n 
\n56608\n \n'

newtest = test.replace("\n", "")
result1 = re.findall(r"Data are cumulative and current as of (.+)\.", newtest, re.IGNORECASE)
print(result1)

# This returns "April 2, 2020 at 12:30PM. *Other commercial and clinical laboratories continue to 
# come on line. As 
# laboratory testing results are processed and the source verified, they will be integrated into this 
# daily report".
# Why does it not find the . after the PM and stop extracting?

# When I attempt to run the pattern again on result1, again, it doesn't seem to find the . after the 
# PM (it extracts until the next period) and
# returns: "April 2, 2020 at 12:30PM. *Other commercial and clinical laboratories continue to come on 
# line".
# What is special about that . after PM that keeps regex from seeing it???
result2 = re.findall(r"(.+)\.", result1[0], re.IGNORECASE)
print(result2)

# If I use the same pattern on a subset of the test string, it sees the . after the PM and returns
# the correct result.  Why???
subtest = 'Data are cumulative and current as of April 2, 2020 at 12:30PM. *Other commercial'
result3 = re.findall(r"Data are cumulative and current as of (.+)\.", subtest, re.IGNORECASE)
print(result3)

I'm guessing I must be doing something stupid but I haven't been able to figure out what it is. Can anyone provide some guidance? I could change the pattern to this (r"Data are cumulative and current as of (.+[P|A]M)") and it works properly on all strings but would still like to know why it seems to be having problems seeing the . after the PM. Appreciate any help.

Sealyons
  • 89
  • 1
  • 1
  • 6

1 Answers1

2

With my limited knowledge of regex, I can find out that your current regex is greedy (Matches between one and unlimited times, as many times as possible, giving back as needed).To make it return the first '.', make it non-greedy by adding the '?' symbol in your equation:

result1 = re.findall(r"Data are cumulative and current as of (.+?)\.", newtest, re.IGNORECASE)

print(result1)

['April 2, 2020 at 12:30PM']

There is an excellent discussion on Greey and Lazy regex here

ManojK
  • 1,570
  • 2
  • 9
  • 17