2

I am trying to execute a python script on all text files in a folder:

for fi in sys.argv[1:]:

And I get the following error

-bash: /usr/bin/python: Argument list too long

The way I call this Python function is the following:

python functionName.py *.txt

The folder has around 9000 files. Is there some way to run this function without having to split my data in more folders etc? Splitting the files would not be very practical because I will have to execute the function in even more files in the future... Thanks

EDIT: Based on the selected correct reply and the comments of the replier (Charles Duffy), what worked for me is the following:

printf '%s\0' *.txt | xargs -0 python ./functionName.py

because I don't have a valid shebang..

codeforester
  • 39,467
  • 16
  • 112
  • 140
adrCoder
  • 3,145
  • 4
  • 31
  • 56
  • 3
    This is not caused by python itself but by the os you use. Here is a related link to that topic: http://stackoverflow.com/questions/5533704/python-sys-argv-limitations But anyway this is no best practice, try something like iced said. – Igl3 Mar 10 '15 at 13:42
  • (On a different point -- Python *modules* should have `.py` extensions. *Executables* written in Python shouldn't have any extension -- executables define commands, and you don't run `ls.elf` -- but instead should use a shebang to indicate their interpreter (`#!/usr/bin/env python` or such) and be marked executable (`chmod +x functionName`). – Charles Duffy Mar 10 '15 at 13:56
  • ...if you use setuptools, it'll automatically build and install wrapper executables for you that invoke the functions you want to be runnable; these wrappers, properly, are executable commands with no extensions. – Charles Duffy Mar 10 '15 at 13:57
  • Related: [Does "argument list too long" restriction apply to shell builtins?](https://stackoverflow.com/questions/47443380/does-argument-list-too-long-restriction-apply-to-shell-builtins) – codeforester Nov 22 '17 at 21:20

4 Answers4

6

This is an OS-level problem (limit on command line length), and is conventionally solved with an OS-level (or, at least, outside-your-Python-process) solution:

find . -maxdepth 1 -type f -name '*.txt' -exec ./your-python-program '{}' +

...or...

printf '%s\0' *.txt | xargs -0 ./your-python-program

Note that this runs your-python-program once per batch of files found, where the batch size is dependent on the number of names that can fit in ARG_MAX; see the excellent answer by Marcus Müller if this is unsuitable.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • When I try the first I am getting the following error: sudo find . -maxdepth 1 -type f -name '*.txt' -exec ./removeHtmlAndUnuuencode.py '{}' + find: ./functionName.py: Permission denied find: ./functionName.py: Permission denied – adrCoder Mar 10 '15 at 13:51
  • When I try the second I am getting the following error: xargs: ./functionName.py: Permission denied -bash: printf: write error: Broken pipe – adrCoder Mar 10 '15 at 13:52
  • The "permission denied" means what it says; you need to `chmod +x ./functionName.py`, and be sure it starts with a shebang (`#!/usr/bin/env python`, or `#!/usr/bin/env python2`, as appropriate for your OS). Or `-exec python ./your-python-program {} +` to avoid the need for a shebang at all. – Charles Duffy Mar 10 '15 at 13:52
  • File "./functionName.py", line 68, in with open("strip_" + fi, "w") as f: IOError: [Errno 2] No such file or directory: 'strip_./0000950144-08-001292.txt' the initial file is called 0000950144-08-001292.txt' and the processed should be called strip_0000950144-08-001292.txt but why does it try to add ./ and then fails ? – adrCoder Mar 10 '15 at 13:58
  • The "why" is "because that's how find works". Use the second formulation (with xargs) if your program can't deal with that. – Charles Duffy Mar 10 '15 at 13:59
  • I upvoted you for all the effort and what you're writing (many of which I can not understand :) ), hope we can solve this issue – adrCoder Mar 10 '15 at 13:59
  • from: can't read /var/mail/bs4 ./functionName.py: line 2: import: command not found ./functionName.py: line 3: import: command not found ./functionName.py: line 6: syntax error near unexpected token `(' ./functionName.py: line 6: `def unuuencode (iterator, collector=None, ignore_length_errors=False):' from: can't read /var/mail/bs4 is the error I get when I try to use the second approach.. – adrCoder Mar 10 '15 at 14:01
  • That looks like you don't have a valid shebang line in your script, so it's being treated as a shell script instead of a Python script. Make it `xargs -0 python ./your-script` if you don't want to fix that immediately (I've described what a valid Python shebang looks like elsewhere). – Charles Duffy Mar 10 '15 at 14:02
  • Awesome. Seems to be working (first 2 files are already produced so I suppose it should be ok now). Thanks @Charles Duffy! Could you please tell what "a valid shebang line in my script" would be? Thanks again ! – adrCoder Mar 10 '15 at 14:06
  • A shebang line would be a first line starting with the characters `#!` and then the name of the interpreter to use to run the script (like `/usr/bin/python`). Thus, `#!/usr/bin/python` is an example of a valid shebang. – Charles Duffy Mar 10 '15 at 14:09
  • Ok perfect. I didn't know about this before. It seems like the path to where the compiler for Python is found so that when you call the function it knows which tool to use in order to properly execute the program. (If I understand correctly) – adrCoder Mar 10 '15 at 14:10
  • Thank you so very much for the flattery, by the way :) – Marcus Müller Mar 10 '15 at 17:32
2

No. That is a kernel limitation for the length (in bytes) of a command line.

Typically, you can determine that limit by doing

getconf ARG_MAX

which, at least for me, yields 2097152 (bytes), which means about 2MB.

I recommend using python to work through a folder yourself, i.e. giving your python program the ability to work with directories instead of individidual files, or to read file names from a file.

The former can easily be done using os.walk(...), whereas the second option is (in my opinion) the more flexible one. Use the argparse module to give your python program an easy-to-use command line syntax, then add an argument of a file type (see reference documentation), and python will automatically be able to understand special filenames like -, meaning you could instead of

for fi in sys.argv[1:]

do

for fi in opts.file_to_read_filenames_from.read().split(chr(0))

which would even allow you to do something like

find -iname '*.txt' -type f -print0|my_python_program.py -file-to-read-filenames-from - 
Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
  • `printf '%s\n' *.txt` is safer than `ls *.txt` -- see http://mywiki.wooledge.org/ParsingLs – Charles Duffy Mar 10 '15 at 13:53
  • Even better than that, though, would be `printf '%s\0' *.txt` and interpreting the NUL-delimited stream that generates -- otherwise, filenames with literal newlines (yes, they're legal in UNIX) will throw you off. – Charles Duffy Mar 10 '15 at 13:54
  • (`readlines()` is the wrong tool for the job for the same reason). – Charles Duffy Mar 10 '15 at 13:54
  • (This is kibitzing, though -- I like this answer more than most here; it'll have my upvote with some minor improvements). – Charles Duffy Mar 10 '15 at 14:01
  • @CharlesDuffy: totally on your side here. Wait, I'll fix it – Marcus Müller Mar 10 '15 at 14:28
  • @CharlesDuffy: better now? – Marcus Müller Mar 10 '15 at 14:30
  • Much! In a case where my answer isn't suitable (for a tool that needs to operate on all files in a single invocation, for instance), or if one just wants the added efficiency of only doing a single fork/exec rather than one per (slightly-less-than-)`ARG_MAX`-sized batch of files, this is absolutely the right way to go. – Charles Duffy Mar 10 '15 at 15:29
  • Maybe you'd want to add that piece of information (i.e. the program being invoked on every single file isolatedly) to your answer more explicitely, btw. – Marcus Müller Mar 10 '15 at 17:08
  • Pardon? My answer doesn't execute once per file -- it runs the program on every _batch_, where batch size is dependent on the number of filenames that can fit inside `ARG_MAX`. But yes, that's worth mentioning, and I've made an appropriate edit. – Charles Duffy Mar 10 '15 at 17:10
  • (Using `find ... -exec ... {} ';'` would be once-per-file, but the newer [POSIX-standardized in 2006, IIRC] `-exec ... {} +` behavior removes that limitation). – Charles Duffy Mar 10 '15 at 17:13
  • @CharlesDuffy: Ahhh nice, didn't know about ` {} +`, sorry. – Marcus Müller Mar 10 '15 at 17:27
1

Don't do it this way. Pass mask to your python script (e.g. call it as python functionName.py "*.txt") and expand it using glob (https://docs.python.org/2/library/glob.html).

iced
  • 1,562
  • 8
  • 10
  • Sure, one _can_ do that, but it's not conventional/standard behavior on UNIX by any means. `ls` takes a list of filenames, not a wildcard (for instance); same for `tar`, and... well, pretty much everything other standard tool. – Charles Duffy Mar 10 '15 at 13:49
  • Also, good luck passing a file with a glob in its name that matches other files, without having your program operate against those other files as well. – Charles Duffy Mar 10 '15 at 13:50
  • ls will not for him in this case as well. – iced Mar 10 '15 at 13:50
  • correct, `ls` will have the same failure -- but there's a standard way to solve this for `ls`, and that standard way will work here as well. – Charles Duffy Mar 10 '15 at 13:51
  • there is no "standard" way to do this except making `ls` part of the shell (so it can hook in before glob expansion). or expanding mask and piping file list to stdin, of course. – iced Mar 10 '15 at 14:03
  • `xargs` is a POSIX-standardized tool. `find -exec {} +` was added to POSIX in 2006, so that too is literally standardized. – Charles Duffy Mar 10 '15 at 14:05
  • Also, `pax` is a POSIX-standardized tool, and that reads lists of filenames on stdin, another mechanism for doing this literally encoded in the standard. – Charles Duffy Mar 10 '15 at 14:06
  • ...and no, a shell couldn't work around this in any way other than how `xargs` and `find` already do -- the argument length limit is an OS limitation (on the size of the buffer used for both environment variables and command-line arguments), not a shell limitation. That limit is present even directly calling `execv()`-family functions from C. – Charles Duffy Mar 10 '15 at 14:08
  • there is no execv() if `ls` is just function inside shell (and sometimes it is, depends on shell). – iced Mar 10 '15 at 17:45
  • Sure, but this limitation doesn't exist at all if it's a function inside the shell being called. – Charles Duffy Mar 10 '15 at 17:48
  • Oh! I get where you're going now. :) – Charles Duffy Mar 10 '15 at 17:53
1

I think about using glob module. With this module you invoke your program like:

python functionName.py "*.txt"

then shell will not expand *.txt into file names. You Python program will receive *.txt in argumens list and you can pass it into glob.glob():

for fi in glob.glob(sys.argv[1]):
    ...
Michał Niklas
  • 53,067
  • 18
  • 70
  • 114
  • Traceback (most recent call last): File "functionName.py", line 64, in for fi in glob.glob(sys.argv[1:]): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/glob.py", line 27, in glob return list(iglob(pathname)) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/glob.py", line 38, in iglob if not has_magic(pathname): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/glob.py", line 95, in has_magic return magic_check.search(s) is not None TypeError: expected string or buff – adrCoder Mar 10 '15 at 13:46
  • ^ That's the error I am getting when I try to do what you're saying – adrCoder Mar 10 '15 at 13:47
  • Yes. I edited it now with assumption that only one file mask will be in use. – Michał Niklas Mar 10 '15 at 13:58