0

I am working on a large batch of text strings, trying to match date times and convert them to MM-DD-YYYY format using strptime() function.

However, there are some 5-digit serial number appeared in the texts (e.g., 90481) that have mislead my .findall() function to treat them as date times. How can I avoid them by including a ^() type of condition to exclude them?

What them have in common is that they are all 5-digit, so I have tried ^(?!\d{5}) but it didn't turn out well. What's the best way to tackle this set of number?

Thank you.

Note1: I have read this post, but can't seem to get it.

Note2: about date format someone have asked in the comment section

There are many date formats in the data frame I am working on, for example:

05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015

So I have a rather long .findall(r' ') function, but the main point is to avoid those 5-digit serial number from be selected.

Sincerely,

Chris T.
  • 1,699
  • 7
  • 23
  • 45
  • 1
    How does your findall works in the first place? Please post the full regex. – Willem Van Onsem Aug 05 '17 at 16:10
  • I'll add that to the original question thread. – Chris T. Aug 05 '17 at 16:11
  • If you could explain in plain English what exactly you need to match I would be more ale to help you. Date times can be written with many formats so not knowing what exactly you are working with makes it hard. – kpie Aug 05 '17 at 16:12
  • The regex doesn't seem to match 90481 – marvel308 Aug 05 '17 at 16:15
  • I have added my (rather simple) code into the original thread, just trying to avoid those 5-digit serial number, so that Python won't treat them as date times. – Chris T. Aug 05 '17 at 16:15

1 Answers1

1

You could use \b in your regex, to avoid that a match is found halfway a number with more digits. Place one at the start and one at the end, and make sure they are not included in the scope of the | (OR) operation by wrapping the rest in a non-capture group.

I removed some months to keep it short:

\b(?:\d{1,2}\/\d{1,2}\/\d{2,4}|(?:Jan|Feb|Mar|Apr|   |Nov|Dec)[a-z]*-\d{2}-\d{2,4})\b
trincot
  • 317,000
  • 35
  • 244
  • 286
  • This works perfectly! Thank you so much. Do you mind me asking how (and why) does \b( )\b work in this context? – Chris T. Aug 05 '17 at 16:42
  • `\b` matches with a break between a sequence of alphanumerical characters and non-alphanumerical characters (it does not match a character, just the fact there is a break in the sequence). So when the first character of your match is supposed to be a digit, the first `\b` requires that there is no digit (or letter or underscore) preceding that matched character. Similar thing happens at the end. – trincot Aug 05 '17 at 16:46