How to count the number of matching files in Python?

Question

I have files that are named as follows:

file-001-001.dat
file-001-002.dat
file-001-003.dat
file-001-004.dat
file-001-005.dat
file-002-001.dat
file-002-002.dat
file-002-003.dat
file-002-004.dat
file-003-001.dat
file-003-002.dat
file-003-003.dat
file-003-004.dat
file-003-005.dat
file-003-006.dat
file-003-007.dat
file-003-008.dat
file-999-010.dat

I am trying to count the number of files for the same first number, e.g. the code should give me the number of files starting with 001 as 5, 002 as 4,... 999 as 1.

I have managed to get it done using this code, that counts the files in 'file_count' folder:

from collections import Counter
import numpy as np
import os
import re
data_folders = []
data_files = []
for root, directories, files in sorted(os.walk('./file_count')):
    files = sorted([f for f in files if os.path.splitext(f)[1] in ('.dat,')])
    for file in files:
        data_folders.append(root)
        data_files.append((re.findall(r"[-+]?\d*\.\d+|\d+", file)[-2].zfill(3), re.findall(r"[-+]?\d*\.\d+|\d+", \
            file)[-1].zfill(3), os.path.join(root, file)))
data_folders = np.unique(data_folders)
data_files = sorted(data_files)
a = np.array(data_files)
print a[:, 0]
c = Counter(a[:, 0])
print c['001']

Is there a much simpler and efficient code than this? Any built in function that can solve this?

Small comment is you probably meant `...in ('.dat',)` with the comma otherwise it'll be treated as a string instead of tuple and incorrectly matches files with '.d' or '.da' extensions — lemonhead, Aug 30 '15 at 06:28

score 1 · Answer 1 · edited Nov 01 '20 at 07:27

You can use os.listdir() which will return your file names as a list of string.

Next, use re.match and list comprehension to get a list of number string you want to group by.

>>> stt = 'file-001-003.dat'
>>> import re
>>> k = re.match(r'.*?-(\d*?)-.*',stt)
>>> k.group(1)
'001'

Finally, use the groupby module to get a count of identical number string.

See this SO for groupby: How to count the frequency of the elements in an unordered list?

score 1 · Answer 2 · answered Aug 30 '15 at 06:34

1

The following approach should work:

for k, g in itertools.groupby(files, key=lambda x:re.search('-(\d+)-', x).group(1)):
    print k, len(list(g))

This would display:

answered Aug 30 '15 at 06:34

Martin Evans

45,791
17
81
97

score 0 · Answer 3 · answered Aug 30 '15 at 06:27

As you added the R tag to your question (not sure why), here goes a possible R solution:

table(sub('file-([0-9]{3})-[0-9]{3}.dat', '\\1', list.files()))

If you also have some other files in the directory, then pass that regular expression as the pattern argument of list.files to list only the related files.

score 0 · Answer 4 · answered Aug 30 '15 at 06:46

How about something like

import re, collections

file_regex = re.compile('^file-(\d{3})-(\d{3}).dat$')
matches = [file_regex.match(f) for root, dirs, files in \
    os.walk("./file_count") for f in files if file_regex.match(f)]
c = collections.Counter(match.groups()[0] for match in matches)

How to count the number of matching files in Python?

4 Answers4