1

I am currently running sed in a python subprocess, however I am receiving the error:

"OSError: [Errno 7] Argument list too long: 'sed'"

The Python code is:

subprocess.run(['sed', '-i',
                '-e', 's/#/pau/g',
                *glob.glob('label_POS/label_phone_align/dump/*')], check=True)

Where the /dump/ directory has ~13,000 files in it. I have been told that I need to run the command for subsets of the argument list, but I'm can't find how to do that.

  • Instead of reinventing the wheel, you can just run `xargs` to invoke `sed` for you. Provide the filenames to the subprocess' stdin pipe instead of as command-line arguments. – Useless Jan 17 '20 at 16:06
  • Alternatively, ditch `sed` and use pure Python: https://stackoverflow.com/a/31499114/2790838 – Markus Jan 17 '20 at 16:19

2 Answers2

1

Whoever told you that probably meant that you need to split up the glob and run multiple separate commands:

files = glob.glob('label_POS/label_phone_align/dump/*')
i = 0
scale = 100
# process in units of 100 filenames until we have them all
while scale*i < len(files):
    subprocess.run(['sed', '-i',
            '-e', 's/#/pau/g',
            *files[scale*i:scale*(i+1)]], check=True)
    i += 1

and then amalgamate all that output however you need, after the fact. I don't know how many inputs the sed command can accept from the command line, but it's apparently less than 13,000. You can keep changing scale until it doesn't error.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
0

Please scroll down to the end of this answer for the solution I recommend for your specific problem. There's a bit of background here for context and/or future visitors grappling with other "argument list too long" errors.

The exec() system call has a size limit; you cannot pass more than ARG_MAX bytes as arguments to a process, where this system constant's value can usually be queried with the getconf ARG_MAX command on modern systems.

import glob
import subprocess

arg_max = subprocess.run(['getconf', 'ARG_MAX'],
    text=True, check=True, capture_output=True
    ).stdout.strip()
arg_max = int(arg_max)

cmd = ['sed', '-i', '-e', 's/#/pau/g']
files = glob.glob('label_POS/label_phone_align/dump/*')
while files:
    base = sum(len(x) for x in cmd) + len(cmd)
    for l in range(len(files)):
        base += 1 + len(files[l])
        if base > arg_max:
            l -= 1
            break
    subprocess.run(cmd + files[0:l+1], check=True)
    files = files[l+1:]

Of course, the xargs command already does exactly this for you.

import subprocess
import glob

subprocess.run(
    ['xargs', '-r', '-0', 'sed', '-i', '-e', 's/#/pau/g'],
    input=b'\0'.join([x.encode() for x in glob.glob('label_POS/label_phone_align/dump/*') + ['']]),
    check=True)

Simply removing the long path might be enough in you case, though. You are repeating label_POS/label_phone_align/dump/ in front of every file name in the argument array.

import glob
import subprocess
import os

path = 'label_POS/label_phone_align/dump'
files = [os.path.basename(file)
    for file in glob.glob(os.path.join(path, '*'))]
subprocess.run(
    ['sed', '-i', '-e', 's/#/pau/g', *files],
    cwd=path, check=True)

Eventually, perhaps prefer a pure Python solution.

import glob
import fileinput

for line in fileinput.input(glob.glob('label_POS/label_phone_align/dump/*'), inplace=True):
    print(line.replace('#', 'pau'))
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Apologies for the delay, but the code takes hours to run and I can't feasibly try it more than twice a day. Your first suggestion using argmax returned a ````TypeError: __init__() got an unexpected keyword argument 'text'```` and so I changed ````text=True```` to ````universal_newlines=True```` as per a comment on this post: https://stackoverflow.com/q/52663518/11035198 and am running it now. If you have any other suggestions let me know –  Jan 20 '20 at 13:29
  • Also the path was not the issue as I attempted to run the code which removed the long path and received the same error. –  Jan 20 '20 at 13:30
  • 1
    `text=True` was introduced in Python 3.7; like you discovered, `universal_newlines=True` should work with older versions of Python (but is somewhat misleadingly named, as well as cumbersome to type). You can simulate a run by replacing the value of `arg_max` with a small number like 200. But the final code is probably what you should switch to; it completely removes the subprocess and so works around the problem entirely (and should be more efficient to boot). – tripleee Jan 20 '20 at 13:41
  • 1
    "The path was not the issue" isn't really correct. If you are lucky, removing the 13,000 copies of the path will reduce the argument array's size enough to avoid the `ARG_MAX` error, but obviously that also depends on how long the names of the individual files are. On my Mac `ARG_MAX` is 262,144 so with that, we could accommodate 13,000 files whose average file name length is 20 characters in a single `sed` invocation. On Linux I think you should find that the limit is significantly higher. Conversely, repeating `label_POS/label_phone_align/dump/` for every file already consumes 429,000 bytes. – tripleee Jan 20 '20 at 13:45