How to programmatically count the number of files in an archive using python

Question

In the program I maintain it is done as in:

# count the files in the archive
length = 0
command = ur'"%s" l -slt "%s"' % (u'path/to/7z.exe', srcFile)
ins, err = Popen(command, stdout=PIPE, stdin=PIPE,
                 startupinfo=startupinfo).communicate()
ins = StringIO.StringIO(ins)
for line in ins: length += 1
ins.close()

Is it really the only way ? I can't seem to find any other command but it seems a bit odd that I can't just ask for the number of files

What about error checking ? Would it be enough to modify this to:

proc = Popen(command, stdout=PIPE, stdin=PIPE,
             startupinfo=startupinfo)
out = proc.stdout
# ... count
returncode = proc.wait()
if returncode:
    raise Exception(u'Failed reading number of files from ' + srcFile)

or should I actually parse the output of Popen ?

EDIT: interested in 7z, rar, zip archives (that are supported by 7z.exe) - but 7z and zip would be enough for starters

For zip, tar check https://docs.python.org/2/library/zipfile.html and https://docs.python.org/2/library/tarfile.html — Loïc Faure-Lacroix, Jun 29 '15 at 20:19
@LoïcFaure-Lacroix: Thanks - edited. I definitely need 7z... — Mr_and_Mrs_D, Jun 29 '15 at 20:21
Maybe check this out? https://github.com/fancycode/pylzma/blob/master/py7zlib.py py7zlib should be able to read the archive. After that, you could use something similar to zipfile or tarfile to extract the names inside (py7zlib.Archive7z.getnames). — Alex Huszagh, Jun 29 '15 at 20:25

score 15 · Accepted Answer · edited May 23 '17 at 11:47

To count the number of archive members in a zip archive in Python:

#!/usr/bin/env python
import sys
from contextlib import closing
from zipfile import ZipFile

with closing(ZipFile(sys.argv[1])) as archive:
    count = len(archive.infolist())
print(count)

It may use zlib, bz2, lzma modules if available, to decompress the archive.

To count the number of regular files in a tar archive:

#!/usr/bin/env python
import sys
import tarfile

with tarfile.open(sys.argv[1]) as archive:
    count = sum(1 for member in archive if member.isreg())
print(count)

It may support gzip, bz2 and lzma compression depending on version of Python.

You could find a 3rd-party module that would provide a similar functionality for 7z archives.

To get the number of files in an archive using 7z utility:

import os
import subprocess

def count_files_7z(archive):
    s = subprocess.check_output(["7z", "l", archive], env=dict(os.environ, LC_ALL="C"))
    return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders$', s).group(1))

Here's version that may use less memory if there are many files in the archive:

import os
import re
from subprocess import Popen, PIPE, CalledProcessError

def count_files_7z(archive):
    command = ["7z", "l", archive]
    p = Popen(command, stdout=PIPE, bufsize=1, env=dict(os.environ, LC_ALL="C"))
    with p.stdout:
        for line in p.stdout:
            if line.startswith(b'Error:'): # found error
                error = line + b"".join(p.stdout)
                raise CalledProcessError(p.wait(), command, error)
    returncode = p.wait()
    assert returncode == 0
    return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders', line).group(1))

Example:

import sys

try:
    print(count_files_7z(sys.argv[1]))
except CalledProcessError as e:
    getattr(sys.stderr, 'buffer', sys.stderr).write(e.output)
    sys.exit(e.returncode)

To count the number of lines in the output of a generic subprocess:

from functools import partial
from subprocess import Popen, PIPE, CalledProcessError

p = Popen(command, stdout=PIPE, bufsize=-1)
with p.stdout:
    read_chunk = partial(p.stdout.read, 1 << 15)
    count = sum(chunk.count(b'\n') for chunk in iter(read_chunk, b''))
if p.wait() != 0:
    raise CalledProcessError(p.returncode, command)
print(count)

It supports unlimited output.

Could you explain why buffsize=-1 (as opposed to buffsize=1 in your previous answer: stackoverflow.com/a/30984882/281545)

bufsize=-1 means use the default I/O buffer size instead of bufsize=0 (unbuffered) on Python 2. It is a performance boost on Python 2. It is default on the recent Python 3 versions. You might get a short read (lose data) if on some earlier Python 3 versions where bufsize is not changed to bufsize=-1.

This answer reads in chunks and therefore the stream is fully buffered for efficiency. The solution you've linked is line-oriented. bufsize=1 means "line buffered". There is minimal difference from bufsize=-1 otherwise.

and also what the read_chunk = partial(p.stdout.read, 1 << 15) buys us ?

It is equivalent to read_chunk = lambda: p.stdout.read(1<<15) but provides more introspection in general. It is used to implement wc -l in Python efficiently.

Hey thanks ! Could you explain why buffsize=-1 (as opposed to buffsize=1 in your previous answer: http://stackoverflow.com/a/30984882/281545) - and also what the `read_chunk = partial(p.stdout.read, 1 << 15)` buys us ? Really this `buffsize` is a mystery to me (and to my google attempts). Meanwhile since I already have `7z.exe` bundled (and I would like to have the exact error displayed) I think I will go with my answer (except if I did anything blatantly stupid) — Mr_and_Mrs_D, Jun 30 '15 at 14:05
@Mr_and_Mrs_D: you should probably ask about the error handling in `7z.exe` as a separate question: include the following: does `7z` provide a reach set of exit codes to indicate various errors like e.g., [`zip` utility does](http://linux.die.net/man/1/zip)? Does `7z` print its error messages to stderr or does it mix them with the archive member list in the stdout? — jfs, Jun 30 '15 at 15:40
Will do when I find some time and be sure to mention you - thanks :) - the exit codes: http://sevenzip.osdn.jp/chm/cmdline/exit_codes.htm — Mr_and_Mrs_D, Jun 30 '15 at 15:44
@Mr_and_Mrs_D: I've added code example that shows how to get number of files using 7z utility while collecting the error message if necessary. — jfs, Jun 30 '15 at 17:34
E-xce-llent (and that's the gorilla Hettinger speaks of in the video - I was matching with regexes and all that instead of simply parsing the last line - it occurred to me of course but was busy with my regexes). I was going to use mine (no time to test) but _I simply can't resist correctness_ - will use the error checking version (no time to check chek_output) - last question - would I need to `-scsUTF-8 -sccUTF-8` and use u'' or should I take it as it is ? Quick tests suggest that unicode names in the archive do not make a difference but still... — Mr_and_Mrs_D, Jun 30 '15 at 23:27
@Mr_and_Mrs_D: all the code should work as is i.e., no `-scsUTF-8 -sccUTF-8` is necessary. Note: `check_output()`-based version may use more memory than `count_files_7z()` with `Popen()` but the error handling is the same -- you can run the example with both `count_files_7z()` implementations -- though the 2nd variant does not store the output until an error has been encountered (that is why it uses less memory). — jfs, Jun 30 '15 at 23:37
Hi :) I just saw you added the `LC_ALL="C"` in env - why is that ? — Mr_and_Mrs_D, May 08 '17 at 17:24
@Mr_and_Mrs_D: otherwise you might get the messages in another language (depending on your locale) and the regex that uses English words "files", "folders" may fail. — jfs, May 08 '17 at 18:11

score 1 · Answer 2 · edited May 23 '17 at 11:54

Since I already have 7z.exe bundled with the app and I surely want to avoid a third party lib, while I do need to parse rar and 7z archives I think I will go with:

regErrMatch = re.compile(u'Error:', re.U).match # needs more testing
r"""7z list command output is of the form:
   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2015-06-29 21:14:04 ....A       <size>               <filename>
where ....A is the attribute value for normal files, ....D for directories
"""
reFileMatch = re.compile(ur'(\d|:|-|\s)*\.\.\.\.A', re.U).match

def countFilesInArchive(srcArch, listFilePath=None):
    """Count all regular files in srcArch (or only the subset in
    listFilePath)."""
    # https://stackoverflow.com/q/31124670/281545
    command = ur'"%s" l -scsUTF-8 -sccUTF-8 "%s"' % ('compiled/7z.exe', srcArch)
    if listFilePath: command += u' @"%s"' % listFilePath
    proc = Popen(command, stdout=PIPE, startupinfo=startupinfo, bufsize=-1)
    length, errorLine = 0, []
    with proc.stdout as out:
        for line in iter(out.readline, b''):
            line = unicode(line, 'utf8')
            if errorLine or regErrMatch(line):
                errorLine.append(line)
            elif reFileMatch(line):
                length += 1
    returncode = proc.wait()
    if returncode or errorLine: raise StateError(u'%s: Listing failed\n' + 
        srcArch + u'7z.exe return value: ' + str(returncode) +
        u'\n' + u'\n'.join([x.strip() for x in errorLine if x.strip()]))
    return length

Error checking as in Python Popen - wait vs communicate vs CalledProcessError by @JFSebastien

My final(ish) based on accepted answer - unicode may not be needed, kept it for now as I use it everywhere. Also kept regex (which I may expand, I have seen things like re.compile(u'^(Error:.+|.+ Data Error?|Sub items Errors:.+)',re.U). Will have to look into check_output and CalledProcessError.

def countFilesInArchive(srcArch, listFilePath=None):
    """Count all regular files in srcArch (or only the subset in
    listFilePath)."""
    command = [exe7z, u'l', u'-scsUTF-8', u'-sccUTF-8', srcArch]
    if listFilePath: command += [u'@%s' % listFilePath]
    proc = Popen(command, stdout=PIPE, stdin=PIPE, # stdin needed if listFilePath
                 startupinfo=startupinfo, bufsize=1)
    errorLine = line = u''
    with proc.stdout as out:
        for line in iter(out.readline, b''): # consider io.TextIOWrapper
            line = unicode(line, 'utf8')
            if regErrMatch(line):
                errorLine = line + u''.join(out)
                break
    returncode = proc.wait()
    msg = u'%s: Listing failed\n' % srcArch.s
    if returncode or errorLine:
        msg += u'7z.exe return value: ' + str(returncode) + u'\n' + errorLine
    elif not line: # should not happen
        msg += u'Empty output'
    else: msg = u''
    if msg: raise StateError(msg) # consider using CalledProcessError
    # number of files is reported in the last line - example:
    #                                3534900       325332  75 files, 29 folders
    return int(re.search(ur'(\d+)\s+files,\s+\d+\s+folders', line).group(1))

Will edit this with my findings.

you could use `for line in out:` here or better `for line in io.TextIOWrapper(out, encoding='utf-8'):` (to decode bytes to Unicode and to enable the universal newlines mode). Don't use `if len(container)`, use `if container` instead (empty containers are False in Python). `line.startswith('Error:')` could be used instead of the `regErrMatch` regex. Are you sure `7z` prints its errors to stdout (it is unfortunate)? Please, [follow pep-8 naming conventions unless you have a specific reason not to](https://www.python.org/dev/peps/pep-0008/#naming-conventions). — jfs, Jun 30 '15 at 16:19
Yes 7z prints its output in stdout (...) - TextIOWrapper I will have a look. regErrMatch: I may need to elaborate on the regular expression for the errors. PEP8 - it's legacy code, slowly PEP8 'ing it (see also: https://www.youtube.com/watch?v=wf-BqAjZb8M - although 79 chars, I am fully in agreement) — Mr_and_Mrs_D, Jun 30 '15 at 16:26

How to programmatically count the number of files in an archive using python

2 Answers2