20

I am using Popen function from the subprocess module to execute a command line tool:

subprocess.Popen(args, bufsize=0, executable=None, stdin=None, stdout=None, stderr=None, preexec_fn=None, close_fds=False, shell=False, cwd=None, env=None, universal_newlines=False, startupinfo=None, creationflags=0)

The tool I am using takes a list of files that it then processes. In some cases, this list of files can be very long. Is there a way to find the max length that the args parameter can be? With a large number of files being passed to the tool, I am getting the following error:

Traceback (most recent call last):
  File "dump_output_sopuids.py", line 68, in <module>
    uid_map = create_sopuid_to_path_dict_dcmdump(dicom_files)
  File "dump_output_sopuids.py", line 41, in create_sopuid_to_path_dict_dcmdump
    dcmdump_output = subprocess.Popen(cmd,stdout=subprocess.PIPE).communicate(0)[0]
  File "c:\python26\lib\subprocess.py", line 621, in __init__
    errread, errwrite)
  File "c:\python26\lib\subprocess.py", line 830, in _execute_child
    startupinfo)
WindowsError: [Error 206] The filename or extension is too long

Is there a general way to find this max length? I found the following article on msdn: Command prompt (Cmd. exe) command-line string limitation but I don't want to hard code in the value. I would rather get the value at run time to break up the command into multiple calls.

I am using Python 2.6 on Windows XP 64.

Edit: adding code example

paths = ['file1.dat','file2.dat',...,'fileX.dat']
cmd = ['process_file.exe','+p'] + paths
cmd_output = subprocess.Popen(cmd,stdout=subprocess.PIPE).communicate(0)[0]

The problem occurs because each actual entry in the paths list is usually a very long file path AND there are several thousand of them.

I don't mind breaking up the command into multiple calls to process_file.exe. I am looking for a general way to get the max length that args can be so I know how many paths to send in for each run.

Jesse Vogt
  • 16,229
  • 16
  • 59
  • 72
  • could you provide an example value of what you provide for args? – gurney alex Mar 04 '10 at 17:56
  • I'm quite late to the party but I want to add that I got the same error due to my PATH environment variable becoming too long after adding many entries. – RedX Dec 21 '15 at 12:58

2 Answers2

14

If you're passing shell=False, then Cmd.exe does not come into play.

On windows, subprocess will use the CreateProcess function from Win32 API to create the new process. The documentation for this function states that the second argument (which is build by subprocess.list2cmdline) has a max length of 32,768 characters, including the Unicode terminating null character. If lpApplicationName is NULL, the module name portion of lpCommandLine is limited to MAX_PATH characters.

Given your example, I suggest providing a value for executable (args[0]) and using args for the first parameter. If my reading of the CreateProcess documentation and of the subprocess module source code is correct, this should solve your problem.

[edit: removed the args[1:] bit after getting my hands on a windows machine and testing]

Jay Conrod
  • 28,943
  • 19
  • 98
  • 110
gurney alex
  • 13,247
  • 4
  • 43
  • 57
  • I am not sure if I follow your suggestion regarding using args[1:] for the first parameter. I have updated my question with a code example. +1 for the link and tip on CreateProcess – Jesse Vogt Mar 04 '10 at 18:55
  • I tried this but am still hitting a limit: subprocess.Popen(cmd[1:] + paths,executable=cmd[0],stdout=subprocess.PIPE). For now I am using 32000 as the limit for the command length and calling my command multiple times and collecting all the output. I would like to be able to not have 32000 hard coded in but get that value from the environment. – Jesse Vogt Mar 04 '10 at 20:09
  • Well as mentioned in the doc I quoted, the 32768 limit is hard coded in the CreateProcess primitive (thats the upper limit for 16bit signed integers, i.e. 2**15). As list2cmd will add quotes and spaces when building the command line, you will hit that limit before sum([len(a) for a in args]) reaches 2**15. Isn't there a way of using wildcards to pass your arguments to the executable? (wildcards are generally processed by the executable under windows) – gurney alex Mar 05 '10 at 08:11
  • Good point. I was taking spaces into account but had not thought of quotes. I was using 32000 as my hard coded limit which must have left enough room for the spaces. In practice I saw ~190 files fitting for each run (total of about 4k files). Some are in the same directory but most are spread across multiple directories. For now I am just leaving the hard coded limit in. – Jesse Vogt Mar 05 '10 at 12:35
  • Since there has not been too much activity on this question, I am going to accept your answer since you were able to find a good limit. I would have liked to have been able to find a way to more generally get the limit from the OS but this will work. Thanks for the help! – Jesse Vogt Mar 05 '10 at 12:36
  • 1
    Well the limit is hard coded on Windows to 2**15, and that's probably the case on 64bit versions of that OS. On posix system, no limits, except your RAM : Popen uses execvp or execvpe which uses a NULL terminated array of char* for arguments without size constraints. – gurney alex Mar 05 '10 at 14:15
  • @gurneyalex "No limits under POSIX" is not true; see my answer now. – tripleee Mar 02 '19 at 16:51
5

For Unix-like platforms, the kernel constant ARG_MAX is defined by POSIX. It is required to be at least 4096 bytes, though on modern systems, it's probably a megabyte or more.

On many systems, getconf ARG_MAX will reveal its value at the shell prompt.

The shell utility xargs conveniently allows you to break up a long command line. For example, if

python myscript.py *

fails in a large directory because the list of files expands to a value whose length in bytes exceeds ARG_MAX, you can work around it with something like

printf '%s\0' * |
xargs -0 python myscript.py

(The option -0 is a GNU extension, but really the only completely safe way to unambiguously pass a list of file names which could contain newlines, quoting characters, etc.) Maybe also explore

find . -maxdepth 1 -type f -exec python myscript.py {} +

The way these work around the restriction is that they divide up the argument list if it's too long, and run myscript.py multiple times on as many arguments as they can fit onto the command line at a time. Depending on what myscript.py does, this can be exactly what you want, or catastrophically wrong. (For example, if it sums the numbers in the files you pass in, you will get multiple results for each set of arguments that it processed.)

Conversely, to pass a long list of arguments to subprocess.Popen() and friends, something like

p = subprocess.Popen(['xargs', '-0', 'command'],
    stdin=subprocess.PIPE, stdout=subprocess.PIPE,
    stderr=subprocess.PIPE)
out, err = p.communicate('\0'.join(long_long_argument_list))

... where in most scenarios you should probably avoid raw Popen() and let a wrapper function like run() or check_call() do most of the work:

r = subprocess.run(['xargs', '-0', 'command'],
    input='\0'.join(long_long_argument_list),
    universal_newlines=True)
out = r.stdout

subprocess.run() supports text=True in 3.7+ as the new name of universal_newlines=True. Older Python versions than 3.5 didn't have run, so you need to fall back to the older legacy functions check_output, check_call, or (rarely) call.

If you wanted to reimplement xargs in Python, something like this.

import os

def arg_max_args(args):
    """
    Split up the list in `args` into a list of lists
    where each list contains fewer than ARG_MAX bytes
    (including room for a terminating null byte for each
    entry)
    """
    arg_max = os.sysconf("SC_ARG_MAX")
    result = []
    sublist = []
    count = 0
    for arg in args:
        argl = len(arg) + 1
        if count + argl > arg_max:
            result.append(sublist)
            sublist = [arg]
            count = argl
        else:
            sublist.append(arg)
            count += argl
    if sublist:
        result.append(sublist)
    return result

Like the real xargs, you'd run a separate subprocess on each sublist returned by this function.

A proper implementation should raise an error if any one argument is larger than ARG_MAX but this is just a quick demo.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Perhaps see also https://stackoverflow.com/a/51950538/874188 for a number of issues and amplifications around `subprocess` on U*x platforms. – tripleee Mar 02 '19 at 16:46
  • How is it possible that `xargs` itself manages to call `command` with all those args, while Python cannot? Does this mean the limitation lies in Python, not in the system itself? – Pedro A Aug 17 '22 at 05:04
  • `xargs` splits up the command line into smaller chunks. You could reimplement the same logic in Python, of course; but why should you when there is already a tool which does this. – tripleee Aug 17 '22 at 05:09
  • oh, cool, could you please show a sketch on how this would be done in python (editing the answer)? not that I want to reinvent the wheel, only that I think it would be very informative, to learn and understand... because I can't even imagine how it can be possible for one to create a process with less arguments and pass the rest afterwards... Thank you very much!! – Pedro A Aug 17 '22 at 14:30
  • @PedroA See updated answer now. – tripleee Aug 18 '22 at 05:32
  • Thank you very much, I see now! So it creates multiple subprocesses if needed. It won't help my particular use case though, because what I need is just one process, because I'm passing lots of flags to the command - something that can't be split, obviously. – Pedro A Aug 18 '22 at 14:43
  • In fact, I'm a little scared that xargs does this by default. I never thought it could decide to create multiple processes. I understand why it would work for things like `rm a b`, because it is equivalent to `rm a && rm b`, but surely there are multiple cases in which this fails. – Pedro A Aug 18 '22 at 14:45
  • If you need to pass in more data than will fit on a single command line to a single command, don't use the command line; have the tool read a configuration file, or simply stdin. There is no way to pass more information than `ARG_MAX` across an `exec` boundary, but besides, it's probably a more ergonomical as well as less memory-intensive design to avoid it. – tripleee Aug 18 '22 at 15:03
  • Yeah, that's what I thought. Makes sense. Just to confirm: every kind of subprocess generation uses an "exec boundary", right? In my case, I can't change the tool (it's `docker run` with lots of `-e` flags). I will have to find another way. Thank you very much for the help!! – Pedro A Aug 18 '22 at 15:20
  • Yup, just to confirm your understanding; the Unix process creation system calls are `fork` (to clone the current process) and `exec` (to replace the code being run in the clone). – tripleee Aug 18 '22 at 15:26