4

I have been trying to get a list of files matching a glob pattern in a command line argument (sys.argv[1]) recursively using glob.glob and os.walk. The problem is, bash (and many other shells it seems) auto-expand glob patterns into filenames.

How do standard unix programs (e.g. grep -R) do this then? I realize they're not in python, but if this is happening at the shell level, that shouldn't matter, right? Is there a way for a script to tell the shell to not auto-expand glob patterns? It looks like set -f will disable globbing, but I'm not sure how to run this early enough, so to speak.

I've seen Use a Glob() to find files recursively in Python?, but that doesn't cover actually getting the glob patterns from command line arguments.

Thanks!

Edit:

The grep-like perl script ack accepts a perl regex as one of its arguments. Thus, ack .* prints out every line of every file. But .* should expand to all hidden files in a directory. I tried reading the script but I don't know perl; how can it do this?

Community
  • 1
  • 1
Bryan Head
  • 12,360
  • 5
  • 32
  • 50

3 Answers3

6

The shell performs glob expansion before it even thinks of invoking the command. Programs such as grep don't do anything to prevent globbing: they can't. You, as the caller of these programs, must tell the shell that you want to pass the special characters such as * and ? to the program, and not let the shell interpret them. You do that by putting them inside quotes:

grep -E 'ba(na)* split' *.txt

(look for ba split, bana split, etc., in all files called <something>.txt) In this case, either single quotes or double quotes will do the trick. Between single quotes, the shell expands nothing. Between double quotes, $, ` and \ are still interpreted. You can also protect a single character from shell expansion by preceding it with a backslash. It's not only wildcard characters that need to be protected; for example, above, the space in the pattern is in quotes so it's part of the argument to grep and not an argument separator. Alternative ways to write the snippet above include

grep -E "ba(na)* split" *.txt
grep -E ba\(na\)\*\ split *.txt

With most shells, if an argument contains wildcards but the pattern doesn't match any file, the pattern is left unchanged and passed to the underlying command. So a command like

grep b[an]*a *.txt

has a different effect depending on what files are present on the system. If the current directory doesn't contain any file whose name begins with b, the command searches the pattern b[an]*a in the files whose name matches *.txt. If the current directory contains files named baclava, bnm and hello.txt, the command expands to grep baclava bnm hello.txt, so it searches the pattern baclava in the two files bnm and hello.txt. Needless to say, it's a bad idea to rely on this in scripts; on the command line it can occasionally save typing, but it's risky.

When you run ack .* in a directory containing no dot file, the shell runs ack . ... The behavior of the ack command is then to print out all non-empty lines (pattern .: matches any one character) in all files under .. (the parent of the current directory) recursively. Contrast with ack '.*', which searches the pattern .* (which matches anything) in the current directory and its subdirectories (due to the behavior of ack when you don't pass any filename argument).

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
1

When it comes to grep, it simply accept a list of filenames, and doesn't do the glob expansion itself. If you really need to pass a pattern as an argument, it has to be quoted on the command line with single quotes. But before you do that, consider letting the shell do the job it was designed for.

Adam Byrtek
  • 12,011
  • 2
  • 32
  • 32
  • Ah, I see, that's a good point. Been using grep for years and never noticed that it doesn't actually do anything with glob like patterns (ditto for other unix commands). Ah well, thanks! – Bryan Head May 23 '11 at 22:41
  • That's in line with the Unix philosophy that each tool should have a separate responsibility. – Adam Byrtek May 23 '11 at 23:08
1

Yes, set -f, you're on the right track.

It sounds like you are going to call your python program from a shell.

Any time you use a shell to issue a command, it tries scans the cmd-line and processes wild-cards, command-substitution and a whole bunch of other things.

So you have to turn off the the globing before you run the program on the command-line

set -f
echo *
*

myprogram *.txt

will pass the string '*.txt' to your program. Then you can use the internal globbing to get your files.

OR you can do essentially the same thing by creating a wrapper script

 #!/bin/bash
 set -f
 myProgram ${@}

where ${@} are the arguments you pass in when you startmyProgram` either from the command -line, crontab or via exec(...) from another process.

I hope this helps.

shellter
  • 36,525
  • 7
  • 83
  • 90
  • Do you mean explicitly run run `set -f` at the shell first and then run the program? I supposed wrapping the python program in a bash script that call `set -f` first would not work... Ah well, thanks! – Bryan Head May 23 '11 at 22:42
  • Just tried that, and, alas, the program still got the expanded filenames. The same thing occurred when I replaced `myProgram` with `echo ${@}`; it printed the filenames, not the glob. – Bryan Head May 23 '11 at 22:55
  • doah. Yes, ${@} would get the arguments from cmd line, $1, $2 ... $n, meaning those values are already expanded. So a previous comment (which I don't see now), you need to send in the arguments wrapped in in single quotes, i.e. `myWrapper ... '*'` ... Good luck! – shellter May 24 '11 at 02:51