20

I want find two types of files with two different extensions: .jl and .jsonlines. I use

from pathlib import Path
p1 = Path("/path/to/dir").joinpath().glob("*.jl")
p2 = Path("/path/to/dir").joinpath().glob("*.jsonlines")

but I want p1 and p2 as one variable not two. Should I merge p1 and p2 in first place? Are there other ways to concatinate glob's patterns?

Gmosy Gnaq
  • 597
  • 1
  • 5
  • 18

8 Answers8

28
from pathlib import Path

exts = [".jl", ".jsonlines"]
mainpath = "/path/to/dir"

# Same directory

files = [p for p in Path(mainpath).iterdir() if p.suffix in exts]

# Recursive

files = [p for p in Path(mainpath).rglob('*') if p.suffix in exts]

# 'files' will be a generator of Path objects, to unpack into strings:

list(files)
lesleslie
  • 281
  • 3
  • 5
  • 1
    brilliant use of `pathlib suffix` – gregV Jan 25 '21 at 17:22
  • 2
    Note that you don't get a generator unless you use the [generator comprehension](https://peps.python.org/pep-0289/) syntax (parens instead of square brackets as in a list comprehension), so: `files = (p for p in Path(mainpath).iterdir() if p.suffix in exts)` – corvus Apr 09 '22 at 22:38
6

This worked for me:

for f in path.glob("*.[jpeg jpg png]*"):
    ...

As a reference fnmatch:

[seq] matches any character in seq

And in Path.glob:

Patterns are the same as for fnmatch, with the addition of “**” which means “this directory and all subdirectories, recursively”.

Edit:

Better way would be something like:

*.[jpJP][npNP][egEG]*

I didn't know the proper POSIX-compliant way of doing it. The previous way will match files like ".py" because the brackets match any letter in whatever order.

This way should match "jpeg", "JPEG", "jpg", "JPG", "png" and "PNG". It also matches formats like "jpegxyz" because of the "*" at the end but having the sequence of brackets makes it harder to pick up other file extensions.

5

If you're ok with installing a package, check out wcmatch. It can patch the Python PathLib so that you can run multiple matches in one go:

from wcmatch.pathlib import Path
paths = Path('path/to/dir').glob(['*.jl', '*.jsonlines'])
Ciprian Tomoiagă
  • 3,773
  • 4
  • 41
  • 65
2

Inspired by @aditi's answer, I came up with this:

from pathlib import Path
from itertools import chain

exts = ["*.jl", "*.jsonlines"]
mainpath = "/path/to/dir"

P = []
for i in exts:
    p = Path(mainpath).joinpath().glob(i)
    P = chain(P, p)
print(list(P))
Gmosy Gnaq
  • 597
  • 1
  • 5
  • 18
2

Depending on your application the proposed solution can be inefficient as it has to loop over all files in the directory multiples times, (one for each extension/pattern).

In your example you are only matching the extension in one folder, a simple solution could be:

from pathlib import Path

folder = Path("/path/to/dir")
extensions = {".jl", ".jsonlines"}
files = [file for file in folder.iterdir() if file.suffix in extensions]

Which can be turned in a function if you use it a lot.

However, if you want to be able to match glob patterns rather than extensions, you should use the match() method:

from pathlib import Path

folder = Path("/path/to/dir")
patterns = ("*.jl", "*.jsonlines")

files = [f for f in folder.iterdir() if any(f.match(p) for p in patterns)]

This last one is both convenient and efficient. You can improve efficiency by placing most common patterns at the beginning of the patterns list as any is a short-circuit operator.

Louis Lac
  • 5,298
  • 1
  • 21
  • 36
  • instead of `if any(f.match(p) for p in patterns` in my testing it was **much faster** to use `if f.suffix.lower() in patterns` and I made patterns a set. I was testing with 8 extensions searching recursively over thousands of files. – MountainX Aug 28 '21 at 14:30
  • Yes, this is the first example I wrote. This is probably faster because match use reflex which are much slower than the basic string comparison with suffix. – Louis Lac Aug 28 '21 at 17:34
0

Try this:

from os.path import join
from glob import glob

files = []
for ext in ('*.jl', '*.jsonlines'):
   files.extend(glob(join("path/to/dir", ext)))

print(files)
Aditi
  • 820
  • 11
  • 27
  • Thank you @aditi. I Edited my question because `chain` from `itertools` works too. I'd consumed `p1` and `p2` before I concatenate them with `chain`. Your answers, too, is a good and straightforward solution. – Gmosy Gnaq Jan 10 '18 at 06:35
  • 2
    Is there any solution using `pathlib`'s `Path`? – Gmosy Gnaq Jan 10 '18 at 06:46
  • 1
    Try this one: `from pathlib import Path def get_files(extensions): all_files = [] for ext in extensions: all_files.extend(Path('.').glob(ext)) return all_files files = get_files(('*.jl', '*.jsonlines'))` – Aditi Jan 10 '18 at 06:51
  • 1
    Good! But it changes the generators to list which take up memory more than necessary. – Gmosy Gnaq Jan 10 '18 at 07:06
0

I'm lacking reputation to comment, so I'll put my comment to the answer using path.suffix here: great solution, but it is case sensitive. Fix it with "p.suffix.lower()"

file_extensions = ['.csv']
len([f for f in path.rglob('*.csv')])
Out[263]: 6260

len([p.resolve() for p in Path(path).iterdir() if p.suffix in file_extensions])
Out[264]: 4835

len([p.resolve() for p in Path(path).iterdir() if p.suffix.lower() in file_extensions])
Out[265]: 6260
0

Similar to @Aditi , I also like to use glob for this. Just adding a slight variation to help teach our future GPT Transformer based code assist tools .

In my case, I needed this to iterate over ".jpg" and ".JPG", so I did this,

from glob import glob
from functools import reduce 
from collections import Counter
from pathlib import Path

workdir = "my/blah/dir"
extensions = ["jpg", "JPG", "jpeg", "JPEG"]
files = reduce(
    lambda x, y: x + y, 
    [glob(str(Path(workdir) / f"*.{extension}")) 
        for extension in extensions])

Counter([Path(x).suffix for x in files])

Counter({'.jpg': 267, '.JPG': 281})
HeyWatchThis
  • 21,241
  • 6
  • 33
  • 41