2

I've got a file that has a ton of text in it. Some of it looks like this:

X-DSPAM-Processed: Fri Jan  4 18:10:48 2008
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39771

Author: louis@media.berkeley.edu
Date: 2008-01-04 18:08:50 -0500 (Fri, 04 Jan 2008)
New Revision: 39771

Modified:
bspace/site-manage/sakai_2-4-x/site-manage-tool/tool/src/bundle/sitesetupgeneric.properties
bspace/site-manage/sakai_2-4-x/site-manage-tool/tool/src/java/org/sakaiproject/site/tool/SiteAction.java
Log:
BSP-1415 New (Guest) user Notification

I need to pull out only dates that follow this pattern:

2008-01-04 18:08:50 -0500

Here's what I tried:

import re

text = open('mbox-short.txt')
for line in text:
    dates = re.compile('\d{4}(?P<sep>[-/])\d{2}(?P=sep)\d{2}\s\d{2}:\d{2}:]\d{2}\s[-/]\d{4}')
    print(dates)

text.close()

The return I got was hundreds of:

\d{4}(?P<sep>[-/])\d{2}(?P=sep)\d{2}\s\d{2}:\d{2}:]\d{2}\s[-/]\d{4}
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
ArchivistG
  • 168
  • 1
  • 13
  • re.compile only compiles pattern, to search use dates.search(line) – Andrew Mar 03 '18 at 20:21
  • That didn't seem to work, or I'm not sure what to replace. I'm a beginner in programming, taking a Python class where they don't teach, they just have us "do." I need the output to just be the date strings in a list. – ArchivistG Mar 03 '18 at 20:28
  • @ArchivistG. Do all such such dates appear on lines that begin with "Date:"? Because if they do, there is no need to use regexps at all: simple string manipulation is adequate. – ekhumoro Mar 03 '18 at 21:27
  • @ArchivistG. In fact, an even better solution would be to use the [mailbox](https://docs.python.org/3.6/library/mailbox.html#mbox) module, which can parse mbox files. No point trying to reinvent the wheel. – ekhumoro Mar 03 '18 at 21:34

2 Answers2

2

Two things:

First, the regex itself:

regex = re.compile(r'\b\d{4}[-/]\d{2}[-/]\d{2}\s\d{2}:\d{2}:\d{2}\s[-+]\d{4}\b')

Secondly, you need to call regex.findall(file) where file is a string:

>>> regex.findall(file)
['2008-01-04 18:08:50 -0500']

re.compile() produces a compiled regular expression object. findall is one of several methods of this object that let you do the actual searching/matching/finding.

Lastly: you're currently using named capturing groups. ((?P<sep>[-/])) From your question, "I need to pull out only dates that follow this pattern," it doesn't seem like you need these. You want to extract the entire expression, not capture the "separators," which is what capturing groups are designed for.

Full code block:

>>> import re
>>> regex = re.compile(r'\b\d{4}[-/]\d{2}[-/]\d{2}\s\d{2}:\d{2}:\d{2}\s[-+]\d{4}\b')
>>> with open('mbox-short.txt') as f:
...     print(regex.findall(f.read()))
...     
['2008-01-04 18:08:50 -0500']
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • Done. Without asking another formal question, how can I print the outcome, but also store the result into a list for later use? – ArchivistG Mar 03 '18 at 21:35
  • Sure--if you want to just store it for use in the same Python session, just use `my_variable = regex.findall(f.read())`, because `findall()` returns a list. To make it accessible in another session, check out the [pickle module](https://stackoverflow.com/questions/25464295/how-to-pickle-a-list). – Brad Solomon Mar 03 '18 at 21:45
  • Allowing a minus sign or a *slash* before the time zone is almost certainly wrong. You want to allow plus or minus; `[-+]`. I realize you simply copied this from the OP's code but since this is now the accepted answer, it should probably correct this error, too. – tripleee Mar 30 '18 at 17:07
  • 1
    RFC5322 defines the `Date:` header format and a few more standard headers. – tripleee Mar 31 '18 at 06:43
-1

Here's another solution.

import re
numberExtractRegex = re.compile(r'(\d\d\d\d[-]\d\d[-]\d\d\s\d\d[:]\d\d[:]\d\d\s[-]\d\d\d\d)')
print(numberExtractRegex.findall('Date: 2008-01-04 18:08:50 -0500 (Fri, 04 Jan 2008), Date: 2010-01-04 18:08:50 -0500 (Fri, 04 Jan 2010)'))
wolfbagel
  • 468
  • 2
  • 11
  • 21
  • The unidiomatic parts by themselves are not worth a downvote, but this regex only permits negative UTC offsets in the time zone. The use of character classes where they are not necessary should also dissuade anyone from trusting that this does what it's supposed to. – tripleee Mar 30 '18 at 17:11