pandas - series.str.extract is dropping the first character of the capture group

Question

I need to extract the dates from the next Series:

0 03/25/93 Total time of visit (in minutes):\n
1 6/18/85 Primary Care Doctor:\n
2 sshe plans to move as of 7/8/71 In-Home Servic...
3 7 on 9/27/75 Audit C Score Current:\n
4 2/6/96 sleep studyPain Treatment Pain Level (N...
5 .Per 7/06/79 Movement D/O note:\n
6 4, 5/18/78 Patient's thoughts about current su...
7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8 3/7/86 SOS-10 Total Score:\n
9 (4/10/71)Score-1Audit C Score Current:\n
10 (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC...
11 4/09/75 SOS-10 Total Score:\n
12 8/01/98 Communication with referring physician...
13 1/26/72 Communication with referring physician...
14 5/24/1990 CPT Code: 90792: With medical servic...
15 1/25/2011 CPT Code: 90792: With medical servic...
16 4/12/82 Total time of visit (in minutes):\n
17 1; 10/13/1976 Audit C Score, Highest/Date:\n

I am trying it so with the following regex

df.str.extract('.(\d{1,4}/\d{1,4}/\d{1,4}).')

But why is it dropping the first number in the first couple of numbers since I am specifying {1,4}?

For example, from row 7 an 17, it should extract '10/24/89' and '10/13/1976' respectively, instead of '0/24/89' and '0/13/1976'

I am also trying with adding a '?:' at the beginning of the capture group but it does not work

Thanks beforehand!

I think you are facing a RegEx "greedy" vs "lazy" problem. https://stackoverflow.com/q/2301285/6361531 — Scott Boston, Dec 15 '20 at 02:57
Wow! that was a fast response Scott. I only had to add a '?' after the asterisk. Thank you very much! df.str.extract('.\*?(\d{1,4}/\d{1,4}/\d{1,4}).\*') — erick giraldo, Dec 15 '20 at 03:02

score 0 · Answer 1 · answered Dec 15 '20 at 03:05

0

I would put word boundaries around the dates, i.e. use this pattern:

\b(\d+/\d+/\d+)\b

Update code:

df['date'] = df['col'].str.extract('\b(\d+/\d+/\d+)\b')

Here is a regex demo showing that the above pattern be working correctly.

answered Dec 15 '20 at 03:05

Tim Biegeleisen

502,043
27
286
360

pandas - series.str.extract is dropping the first character of the capture group

1 Answers1