Capture date from Excel sheet with RegEx

Question

I know capturing a date is usually a simple enough RegEx task, but I need this to be so specific that I'm struggling.

1 SUSTAINABLE HARVEST SECTOR | QUOTA LISTING JUN 11 2013 2 QUOTA 3 TRADE ID AVAILABLE STOCK AMOUNT PRICE 4 130196 COD GBW 10,000 $0.60 5 130158 HADDOCK GBE 300 $0.60

That is what the beginning of my Excel spreadsheet looks like, and what 100's more look like, with the date and the data changing but the format staying the same.

My thoughts were to capture everything that follows LISTING up until the newline... then place the non numbers (JUN) in my Trade Month column, place the first captured number (11) in my Trade Day column, and place the last captured number (2013) in my Trade Year column... but I can't figure out how to. Here's what I have so far:

pattern = re.compile(r'Listing(.+?)(?=\n)')
df = pd.read_excel(file_path)
        print("df is:", df)
        a = pattern.findall(str(df))
        print("a:", a)

but that returns nothing. Any help solving this problem, which I know is probably super simple, is appreciated. Thanks.

score 1 · Accepted Answer · answered Mar 01 '17 at 20:47

1

Make your expression case insensitive (ie LISTING != Listing):

pattern = re.compile(r'Listing(.+?)(?=\n)', re.IGNORECASE)

Besides, a lookahead for a newline in this situation comes down to the equal expression:

pattern = re.compile(r'Listing(.+)', re.IGNORECASE)

See your working pattern here.

answered Mar 01 '17 at 20:47

Jan

42,290
8
54
79

Ok so that does what I asked perfectly, and I'll 'check off' your answer when I can in 7minutes. But while I have you here: now that I have captured `a: [' JUN 11 2013 Unnamed: 1 \\']` is it easy enough to now isolate and capture/export the `JUN`, `11`, and `2013` pieces individually? – theprowler Mar 01 '17 at 20:51
2

Instead of regex, you can use `datetime.strptime` which will give you a datetime object, which understands what parts are months etc. See http://stackoverflow.com/questions/466345/converting-string-into-datetime – Alan Mar 01 '17 at 20:53
1

@theprowler: It is [**possible**](https://regex101.com/r/pzQVuT/2) but as Alan points out, there are better ways to achieve this. – Jan Mar 01 '17 at 20:55
Ohhhhh damn I didn't even know you could do that. I thought since I didn't have a full month name that I couldn't do something like that since trying `parser.parse()` from `dateutil` failed. Thanks guys, I appreciate you helping a beginner like myself – theprowler Mar 01 '17 at 21:01
Umm real quick: I got `ValueError: time data "[' JUN 11 2013 Unnamed: 1 \\\\']" does not match format '%b %d %Y %I:%M%p'` when trying that `strptime()` function. Should I fix the RegEx to capture less, or alter the `%`s as the second argument? – theprowler Mar 01 '17 at 21:05
1

@theprowler: Where does the `Unnamed` come from? Please post a follow-up question and link from here to it. Comments section is kind of nasty. – Jan Mar 01 '17 at 21:18
I just posted the new question: http://stackoverflow.com/questions/42558338/trouble-using-datetime-strptime Also, I'm not sure what `Unnamed` is or why it gets captured, it's not in the excel file – theprowler Mar 02 '17 at 14:41

Capture date from Excel sheet with RegEx

1 Answers1