189

I have a directory with a bunch of files inside: eee2314, asd3442 ... and eph.

I want to exclude all files that start with eph with the glob function.

How can I do it?

Stefan van den Akker
  • 6,661
  • 7
  • 48
  • 63
Anastasios Andronidis
  • 6,310
  • 4
  • 30
  • 53

12 Answers12

290

The pattern rules for glob are not regular expressions. Instead, they follow standard Unix path expansion rules. There are only a few special characters: two different wild-cards, and character ranges are supported [from pymotw: glob – Filename pattern matching].

So you can exclude some files with patterns.
For example to exclude manifests files (files starting with _) with glob, you can use:

files = glob.glob('files_path/[!_]*')
Kenly
  • 24,317
  • 7
  • 44
  • 60
  • 20
    This must be at official documentation, please somebody add this to https://docs.python.org/3.5/library/glob.html#glob.glob – Vitaly Zdanevich Jul 12 '16 at 06:40
  • 18
    Note that glob patterns can't directly fullfill the requirement set out by the OP: to exclude only files that start with `eph` but can start with anything else. `[!e][!p][!h]` will filter out files that start with `eee` for example. – Martijn Pieters Jan 08 '19 at 13:23
  • 2
    @VitalyZdanevich it is in the documentation for fnmatch: https://docs.python.org/3/library/fnmatch.html#module-fnmatch – Wasi Master Aug 19 '21 at 07:52
108

You can deduct sets and cast it back as a list:

list(set(glob("*")) - set(glob("eph*")))
Community
  • 1
  • 1
neutrinus
  • 1,879
  • 2
  • 16
  • 21
  • 7
    Really interesting solution! But my case is going to be extremely slow to make a read twice. Also if the content of a folder is big on an network directory, is going to be slow again. But in any case, really handy. – Anastasios Andronidis Feb 03 '14 at 18:56
  • Your operating system should cache filesystem requests so not so bad :) – neutrinus Feb 04 '14 at 14:26
  • Tried this myself, I just got TypeError: unsupported operand type(s) for -: 'list' and 'list' – Tom Busby Jul 17 '14 at 16:21
  • 1
    @TomBusby Try converting them to sets: `set(glob("*")) - set(glob("eph*"))` (and notice * at the end of "eph*") – Jaszczur Sep 10 '14 at 13:48
  • 2
    Just as a side note, glob returns lists and not sets, but this kind of operation only works on sets, hence why [neutrinus](https://stackoverflow.com/users/1216074/neutrinus) cast it. If you need it to remain a list, simply wrap the entire operation in a cast: `list(set(glob("*")) - set(glob("eph")))` – Nathan Smith Aug 10 '17 at 21:48
  • Genius answer!! – Philippe Remy Jun 21 '20 at 01:37
  • really annoying that `glob` does take `pathlib` objects. Now your beautiful answer has so many parenthesis it makes me puke. `tasks_folder = set(glob(str(data_path / '*'))) - set(glob(str(data_path / "f_avg"))) ` – Charlie Parker Jul 20 '20 at 16:53
59

You can't exclude patterns with the glob function, globs only allow for inclusion patterns. Globbing syntax is very limited (even a [!..] character class must match a character, so it is an inclusion pattern for every character that is not in the class).

You'll have to do your own filtering; a list comprehension usually works nicely here:

files = [fn for fn in glob('somepath/*.txt') 
         if not os.path.basename(fn).startswith('eph')]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 3
    Use ``iglob`` here to avoid storing the full list in memory – Eugene Pankov Oct 24 '14 at 08:14
  • 5
    @Hardex: internally, `iglob` produces lists *anyway*; all you do is lazily evaluate the filter. It won't help to reduce the memory footprint. – Martijn Pieters Oct 24 '14 at 08:38
  • @Hardex: if you use a glob in the *directory name* then you'd have a point, then at most one `os.listdir()` result is kept in memory as you iterate. But `somepath/*.txt` has to read all filenames in one directory in memory, then reduce that list down to only those that match. – Martijn Pieters Oct 24 '14 at 08:42
  • you're right, it's not that important, but in stock CPython, ``glob.glob(x) = list(glob.iglob(x))``. Not much of an overhead but still good to know. – Eugene Pankov Oct 24 '14 at 10:45
  • Doesn't this iterate twice?. Once through the files to get the list and the second through the list itself? If so, is it not possible to do it in one iteration? – Ridhuvarshan Nov 09 '18 at 12:41
  • @Ridhuvarshan: No, the list comprehension does just the one iteration. But if all you are going to do with the `files` list is iterate, then you could just as well make it a generator expression. – Martijn Pieters Nov 09 '18 at 14:38
  • Correct me if I am wrong, but shouldn't this be `glob.glob()`? At least that's how I got it working. – Felix Phl Jan 27 '20 at 09:54
  • @FelixPhl: depends on how you import it. If you use `from glob import glob` then the global name `glob` in your module is the function. If you use `import glob` then the global name is the module, and you need to use `glob.glob()`. – Martijn Pieters Jan 27 '20 at 11:46
16

Compared with glob, I recommend pathlib. Filtering one pattern is very simple.

from pathlib import Path

p = Path(YOUR_PATH)
filtered = [x for x in p.glob("**/*") if not x.name.startswith("eph")]

And if you want to filter a more complex pattern, you can define a function to do that, just like:

def not_in_pattern(x):
    return (not x.name.startswith("eph")) and not x.name.startswith("epi")


filtered = [x for x in p.glob("**/*") if not_in_pattern(x)]

Using that code, you can filter all files that start with eph or start with epi.

aschultz
  • 1,658
  • 3
  • 20
  • 30
Scott Ming
  • 199
  • 1
  • 6
12

Late to the game but you could alternatively just apply a python filter to the result of a glob:

files = glob.iglob('your_path_here')
files_i_care_about = filter(lambda x: not x.startswith("eph"), files)

or replacing the lambda with an appropriate regex search, etc...

EDIT: I just realized that if you're using full paths the startswith won't work, so you'd need a regex

In [10]: a
Out[10]: ['/some/path/foo', 'some/path/bar', 'some/path/eph_thing']

In [11]: filter(lambda x: not re.search('/eph', x), a)
Out[11]: ['/some/path/foo', 'some/path/bar']
K Raphael
  • 821
  • 8
  • 11
6

How about skipping the particular file while iterating over all the files in the folder! Below code would skip all excel files that start with 'eph'

import glob
import re
for file in glob.glob('*.xlsx'):
    if re.match('eph.*\.xlsx',file):
        continue
    else:
        #do your stuff here
        print(file)

This way you can use more complex regex patterns to include/exclude a particular set of files in a folder.

Azhar Ansari
  • 126
  • 1
  • 4
4

More generally, to exclude files that don't comply with some shell regexp, you could use module fnmatch:

import fnmatch

file_list = glob('somepath')    
for ind, ii in enumerate(file_list):
    if not fnmatch.fnmatch(ii, 'bash_regexp_with_exclude'):
        file_list.pop(ind)

The above will first generate a list from a given path and next pop out the files that won't satisfy the regular expression with the desired constraint.

Lord Henry Wotton
  • 1,332
  • 1
  • 10
  • 11
4

Suppose you have this directory structure:

.
├── asd3442
├── eee2314
├── eph334
├── eph_dir
│   ├── asd330
│   ├── eph_file2
│   ├── exy123
│   └── file_with_eph
├── eph_file
├── not_eph_dir
│   ├── ephXXX
│   └── with_eph
└── not_eph_rest

You can use full globs to filter full path results with pathlib and a generator for the top level directory:

i_want=(fn for fn in Path(path_to).glob('*') if not fn.match('**/*/eph*'))

>>> list(i_want)
[PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), PosixPath('/tmp/test/not_eph_rest'), PosixPath('/tmp/test/not_eph_dir')]

The pathlib method match uses globs to match a path object; The glob '**/*/eph*' is any full path that leads to a file with a name starting with 'eph'.

Alternatively, you can use the .name attribute with name.startswith('eph'):

i_want=(fn for fn in Path(path_to).glob('*') if not fn.name.startswith('eph'))

If you want only files, no directories:

i_want=(fn for fn in Path(path_to).glob('*') if fn.is_file() and not fn.match('**/*/eph*'))
# [PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), PosixPath('/tmp/test/not_eph_rest')]
 

The same method works for recursive globs:

i_want=(fn for fn in Path(path_to).glob('**/*') 
           if fn.is_file() and not fn.match('**/*/eph*'))

# [PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), 
   PosixPath('/tmp/test/not_eph_rest'), PosixPath('/tmp/test/eph_dir/asd330'), 
   PosixPath('/tmp/test/eph_dir/file_with_eph'), PosixPath('/tmp/test/eph_dir/exy123'), 
   PosixPath('/tmp/test/not_eph_dir/with_eph')]
dawg
  • 98,345
  • 23
  • 131
  • 206
0

As mentioned by the accepted answer, you can't exclude patterns with glob, so the following is a method to filter your glob result.

The accepted answer is probably the best pythonic way to do things but if you think list comprehensions look a bit ugly and want to make your code maximally numpythonic anyway (like I did) then you can do this (but note that this is probably less efficient than the list comprehension method):

import glob

data_files = glob.glob("path_to_files/*.fits")

light_files = np.setdiff1d( data_files, glob.glob("*BIAS*"))
light_files = np.setdiff1d(light_files, glob.glob("*FLAT*"))

(In my case, I had some image frames, bias frames, and flat frames all in one directory and I just wanted the image frames)

Ryan Farber
  • 343
  • 1
  • 4
  • 11
0

If the position of the character isn't important, that is for example to exclude manifests files (wherever it is found _) with glob and re - regular expression operations, you can use:

import glob
import re
for file in glob.glob('*.txt'):
    if re.match(r'.*\_.*', file):
        continue
    else:
        print(file)

Or with in a more elegant way - list comprehension

filtered = [f for f in glob.glob('*.txt') if not re.match(r'.*\_.*', f)]

for mach in filtered:
    print(mach)
Milovan Tomašević
  • 6,823
  • 1
  • 50
  • 42
0

To exclude exact word you may want to implement custom regex directive, which you will then replace by empty string before glob processing.

#!/usr/bin/env python3
import glob
import re

# glob (or fnmatch) does not support exact word matching. This is custom directive to overcome this issue
glob_exact_match_regex = r"\[\^.*\]"
path = "[^exclude.py]*py"  # [^...] is a custom directive, that excludes exact match

# Process custom directive
try:  # Try to parse exact match direction
    exact_match = re.findall(glob_exact_match_regex, path)[0].replace('[^', '').replace(']', '')
except IndexError:
    exact_match = None
else:  # Remove custom directive
    path = re.sub(glob_exact_match_regex, "", path)
paths = glob.glob(path)
# Implement custom directive
if exact_match is not None:  # Exclude all paths with specified string
    paths = [p for p in paths if exact_match not in p]

print(paths)
0

import glob
import re

""" This is a path that should be excluded """
EXCLUDE = "/home/koosha/Documents/Excel"

files = glob.glob("/home/koosha/Documents/**/*.*" , recursive=True)
for file in files:
     if re.search(EXCLUDE,file):
         pass
    else:
         print(file)