1

I have files that are named as follows:

  • file-001-001.dat
  • file-001-002.dat
  • file-001-003.dat
  • file-001-004.dat
  • file-001-005.dat

  • file-002-001.dat

  • file-002-002.dat
  • file-002-003.dat
  • file-002-004.dat

  • file-003-001.dat

  • file-003-002.dat
  • file-003-003.dat
  • file-003-004.dat
  • file-003-005.dat
  • file-003-006.dat
  • file-003-007.dat
  • file-003-008.dat

  • file-999-010.dat

I am trying to count the number of files for the same first number, e.g. the code should give me the number of files starting with 001 as 5, 002 as 4,... 999 as 1.

I have managed to get it done using this code, that counts the files in 'file_count' folder:

from collections import Counter
import numpy as np
import os
import re
data_folders = []
data_files = []
for root, directories, files in sorted(os.walk('./file_count')):
    files = sorted([f for f in files if os.path.splitext(f)[1] in ('.dat,')])
    for file in files:
        data_folders.append(root)
        data_files.append((re.findall(r"[-+]?\d*\.\d+|\d+", file)[-2].zfill(3), re.findall(r"[-+]?\d*\.\d+|\d+", \
            file)[-1].zfill(3), os.path.join(root, file)))
data_folders = np.unique(data_folders)
data_files = sorted(data_files)
a = np.array(data_files)
print a[:, 0]
c = Counter(a[:, 0])
print c['001']

Is there a much simpler and efficient code than this? Any built in function that can solve this?

Tom Kurushingal
  • 6,086
  • 20
  • 54
  • 86
  • Small comment is you probably meant `...in ('.dat',)` with the comma otherwise it'll be treated as a string instead of tuple and incorrectly matches files with '.d' or '.da' extensions – lemonhead Aug 30 '15 at 06:28

4 Answers4

1

You can use os.listdir() which will return your file names as a list of string.

Next, use re.match and list comprehension to get a list of number string you want to group by.

>>> stt = 'file-001-003.dat'
>>> import re
>>> k = re.match(r'.*?-(\d*?)-.*',stt)
>>> k.group(1)
'001'

Finally, use the groupby module to get a count of identical number string.

See this SO for groupby: How to count the frequency of the elements in an unordered list?

Guy Avraham
  • 3,482
  • 3
  • 38
  • 50
1

The following approach should work:

for k, g in itertools.groupby(files, key=lambda x:re.search('-(\d+)-', x).group(1)):
    print k, len(list(g))

This would display:

001 5
002 4
003 8
999 1
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
0

As you added the R tag to your question (not sure why), here goes a possible R solution:

table(sub('file-([0-9]{3})-[0-9]{3}.dat', '\\1', list.files()))

If you also have some other files in the directory, then pass that regular expression as the pattern argument of list.files to list only the related files.

daroczig
  • 28,004
  • 7
  • 90
  • 124
0

How about something like

import re, collections

file_regex = re.compile('^file-(\d{3})-(\d{3}).dat$')
matches = [file_regex.match(f) for root, dirs, files in \
    os.walk("./file_count") for f in files if file_regex.match(f)]
c = collections.Counter(match.groups()[0] for match in matches)
lemonhead
  • 5,328
  • 1
  • 13
  • 25