Extract substring from filename in Python?

Question

I have a directory full of files that have date strings as part of the filenames:

file_type_1_20140722_foo.txt
file_type_two_20140723_bar.txt
filetypethree20140724qux.txt

I need to get these date strings from the filenames and save them in an array:

['20140722', '20140723', '20140724']

But they can appear at various places in the filename, so I can't just use substring notation and extract it directly. In the past, the way I've done something similar to this in Bash is like so:

date=$(echo $file | egrep -o '[[:digit:]]{8}' | head -n1)

But I can't use Bash for this because it sucks at math (I need to be able to add and subtract floating point numbers). I've tried glob.glob() and re.match(), but both return empty sets:

>>> dates = [file for file in sorted(os.listdir('.')) if re.match("[0-9]{8}", file)]
>>> print dates
>>> []

I know the problem is it's looking for complete file names that are eight digits long, but I have no idea how to make it look for substrings instead. Any ideas?

Use `re.search` instead of `match`, and put the digits inside parentheses to get a match group. — Tom Zych, Jul 22 '14 at 18:46
@Batman no, because the numbers are sometimes offset by underscores, and sometimes jammed up next to text. — Jonathan E. Landrum, Jul 22 '14 at 18:47
@TomZych that doesn't give the substring, just the files that have that substring matching the pattern (all of them). — Jonathan E. Landrum, Jul 22 '14 at 18:49

unutbu · Accepted Answer · 2018-04-07T20:43:43.480

6

>>> import re
>>> import os
>>> [date for file in os.listdir('.') for date in re.findall("(\d{8})", file)]
['20140722', '20140723']

Note that if a filename has a 9-digit substring, then only the first 8 digits will be matched. If a filename contains a 16-digit substring, there will be 2 non-overlapping matches.

edited Apr 07 '18 at 20:43

answered Jul 22 '14 at 18:56

unutbu

842,883
184
1,785
1,677

1

Just a note to newcomers to Python... make sure you import the regular expressions engine with `import re`. :) I couldn't upvote because I exhausted my daily vote limit. hehehe – Leniel Maccaferri Apr 07 '18 at 20:23
1

@LenielMacaferi: Thanks for the improvement. – unutbu Apr 07 '18 at 20:44

score 2 · Answer 2 · answered Jul 22 '14 at 18:49

2

Your regular expression looks good, but you should be using re.search instead of re.match so that it will search for that expression anywhere in the string:

import re
r = re.compile("[0-9]{8}")
m = r.search(filename)
if m:
    print m.group(0)

answered Jul 22 '14 at 18:49

Andrew Johnson

3,078
1
18
24

This gives the full file name, not the stubstrings – Jonathan E. Landrum Jul 22 '14 at 18:53
I missed the group() part, my bad – Jonathan E. Landrum Jul 22 '14 at 19:14

score 1 · Answer 3 · answered Jul 22 '14 at 18:54

re.match matches from the beginning of the string. re.search matches the pattern anywhere. Or you can try this:

extract_dates = re.compile("[0-9]{8}").findall
dates = [dates[0] for dates in sorted(
    extract_dates(filename) for filename in os.listdir('.')) if dates]

Extract substring from filename in Python?

3 Answers3

Linked