How to glob two patterns with pathlib?

Question

I want find two types of files with two different extensions: .jl and .jsonlines. I use

from pathlib import Path
p1 = Path("/path/to/dir").joinpath().glob("*.jl")
p2 = Path("/path/to/dir").joinpath().glob("*.jsonlines")

but I want p1 and p2 as one variable not two. Should I merge p1 and p2 in first place? Are there other ways to concatinate glob's patterns?

With runtimes: https://stackoverflow.com/q/4568580/880783?#answer-56619011 — Giova, Jun 16 '19 at 19:33

lesleslie · Answer 1 · 2019-09-11T16:24:48.720

28

from pathlib import Path

exts = [".jl", ".jsonlines"]
mainpath = "/path/to/dir"

# Same directory

files = [p for p in Path(mainpath).iterdir() if p.suffix in exts]

# Recursive

files = [p for p in Path(mainpath).rglob('*') if p.suffix in exts]

# 'files' will be a generator of Path objects, to unpack into strings:

list(files)

edited Sep 11 '19 at 16:24

answered Sep 11 '19 at 16:06

lesleslie

281
3
5

1

brilliant use of `pathlib suffix` – gregV Jan 25 '21 at 17:22
2

Note that you don't get a generator unless you use the [generator comprehension](https://peps.python.org/pep-0289/) syntax (parens instead of square brackets as in a list comprehension), so: `files = (p for p in Path(mainpath).iterdir() if p.suffix in exts)` – corvus Apr 09 '22 at 22:38

Alberto Valdez · Answer 2 · 2022-09-30T14:43:38.700

This worked for me:

for f in path.glob("*.[jpeg jpg png]*"):
    ...

As a reference fnmatch:

[seq] matches any character in seq

And in Path.glob:

Patterns are the same as for fnmatch, with the addition of “**” which means “this directory and all subdirectories, recursively”.

Edit:

Better way would be something like:

*.[jpJP][npNP][egEG]*

I didn't know the proper POSIX-compliant way of doing it. The previous way will match files like ".py" because the brackets match any letter in whatever order.

This way should match "jpeg", "JPEG", "jpg", "JPG", "png" and "PNG". It also matches formats like "jpegxyz" because of the "*" at the end but having the sequence of brackets makes it harder to pick up other file extensions.

score 5 · Answer 3 · answered Dec 03 '19 at 15:13

5

If you're ok with installing a package, check out wcmatch. It can patch the Python PathLib so that you can run multiple matches in one go:

from wcmatch.pathlib import Path
paths = Path('path/to/dir').glob(['*.jl', '*.jsonlines'])

answered Dec 03 '19 at 15:13

Ciprian Tomoiagă

3,773
4
41
65

how can you modify this to search recursively in all subdirectories? – MountainX Aug 28 '21 at 14:33
replace `.glob` with `.rglob` (short for recursive glob) – Ciprian Tomoiagă Aug 30 '21 at 14:23
1

This is just what I was looking for, brilliant. – Somebody Out There Apr 06 '22 at 11:04

score 2 · Answer 4 · answered Jan 10 '18 at 06:59

2

Inspired by @aditi's answer, I came up with this:

from pathlib import Path
from itertools import chain

exts = ["*.jl", "*.jsonlines"]
mainpath = "/path/to/dir"

P = []
for i in exts:
    p = Path(mainpath).joinpath().glob(i)
    P = chain(P, p)
print(list(P))

answered Jan 10 '18 at 06:59

Gmosy Gnaq

597
1
5
18

`itertools.chain.from_iterable()` will shorten the for-loop to a single line. – vdboor Nov 15 '22 at 10:39

Louis Lac · Answer 5 · 2021-04-01T13:41:47.597

Depending on your application the proposed solution can be inefficient as it has to loop over all files in the directory multiples times, (one for each extension/pattern).

In your example you are only matching the extension in one folder, a simple solution could be:

from pathlib import Path

folder = Path("/path/to/dir")
extensions = {".jl", ".jsonlines"}
files = [file for file in folder.iterdir() if file.suffix in extensions]

Which can be turned in a function if you use it a lot.

However, if you want to be able to match glob patterns rather than extensions, you should use the match() method:

from pathlib import Path

folder = Path("/path/to/dir")
patterns = ("*.jl", "*.jsonlines")

files = [f for f in folder.iterdir() if any(f.match(p) for p in patterns)]

This last one is both convenient and efficient. You can improve efficiency by placing most common patterns at the beginning of the patterns list as any is a short-circuit operator.

instead of `if any(f.match(p) for p in patterns` in my testing it was **much faster** to use `if f.suffix.lower() in patterns` and I made patterns a set. I was testing with 8 extensions searching recursively over thousands of files. — MountainX, Aug 28 '21 at 14:30
Yes, this is the first example I wrote. This is probably faster because match use reflex which are much slower than the basic string comparison with suffix. — Louis Lac, Aug 28 '21 at 17:34

score 0 · Accepted Answer · answered Jan 10 '18 at 06:11

0

Try this:

from os.path import join
from glob import glob

files = []
for ext in ('*.jl', '*.jsonlines'):
   files.extend(glob(join("path/to/dir", ext)))

print(files)

answered Jan 10 '18 at 06:11

Aditi

820
11
27

Thank you @aditi. I Edited my question because `chain` from `itertools` works too. I'd consumed `p1` and `p2` before I concatenate them with `chain`. Your answers, too, is a good and straightforward solution. – Gmosy Gnaq Jan 10 '18 at 06:35
2

Is there any solution using `pathlib`'s `Path`? – Gmosy Gnaq Jan 10 '18 at 06:46
1

Try this one: `from pathlib import Path def get_files(extensions): all_files = [] for ext in extensions: all_files.extend(Path('.').glob(ext)) return all_files files = get_files(('*.jl', '*.jsonlines'))` – Aditi Jan 10 '18 at 06:51
1

Good! But it changes the generators to list which take up memory more than necessary. – Gmosy Gnaq Jan 10 '18 at 07:06

score 0 · Answer 7 · answered Sep 15 '22 at 14:58

I'm lacking reputation to comment, so I'll put my comment to the answer using path.suffix here: great solution, but it is case sensitive. Fix it with "p.suffix.lower()"

file_extensions = ['.csv']
len([f for f in path.rglob('*.csv')])
Out[263]: 6260

len([p.resolve() for p in Path(path).iterdir() if p.suffix in file_extensions])
Out[264]: 4835

len([p.resolve() for p in Path(path).iterdir() if p.suffix.lower() in file_extensions])
Out[265]: 6260

score 0 · Answer 8 · answered Jan 01 '23 at 23:31

Similar to @Aditi , I also like to use glob for this. Just adding a slight variation to help teach our future GPT Transformer based code assist tools .

In my case, I needed this to iterate over ".jpg" and ".JPG", so I did this,

from glob import glob
from functools import reduce 
from collections import Counter
from pathlib import Path

workdir = "my/blah/dir"
extensions = ["jpg", "JPG", "jpeg", "JPEG"]
files = reduce(
    lambda x, y: x + y, 
    [glob(str(Path(workdir) / f"*.{extension}")) 
        for extension in extensions])

Counter([Path(x).suffix for x in files])

Counter({'.jpg': 267, '.JPG': 281})

How to glob two patterns with pathlib?

8 Answers8

Linked