256

I am working on a script to recursively go through subfolders in a mainfolder and build a list off a certain file type. I am having an issue with the script. It's currently set as follows:

for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,subFolder,item))

the problem is that the subFolder variable is pulling in a list of subfolders rather than the folder that the ITEM file is located. I was thinking of running a for loop for the subfolder before and join the first part of the path but I figured I'd double check to see if anyone has any suggestions before that.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
user2709514
  • 2,561
  • 2
  • 13
  • 3

13 Answers13

298

You should be using the dirpath which you call root. The dirnames are supplied so you can prune it if there are folders that you don't wish os.walk to recurse into.

import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']

Edit:

After the latest downvote, it occurred to me that glob is a better tool for selecting by extension.

import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

Also a generator version

from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))

Edit2 for Python 3.4+

from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • 3
    '*.[Tt][Xx][Tt]' glob pattern will make the search case-insensitive. – SergiyKolesnikov Aug 29 '18 at 16:50
  • @SergiyKolesnikov, Thanks, I've used that in the edit at the bottom. Note that the `rglob` is insensitive on Windows platforms - but it's not portably insensitive. – John La Rooy Aug 29 '18 at 19:15
  • 1
    @JohnLaRooy It works with `glob` too (Python 3.6 here): `glob.iglob(os.path.join(real_source_path, '**', '*.[xX][mM][lL]')` – SergiyKolesnikov Aug 29 '18 at 19:22
  • @Sergiy: Your `iglob` does not work for files in sub-sub folders or below. You need to add `recursive=True`. – user136036 Jan 18 '20 at 16:48
  • I don't see how `glob` "is a better tool for selecting by extension". It's slow. Like half the speed of your other solution. I did a full speed analysis here: https://stackoverflow.com/a/59803793/2441026 – user136036 Jan 18 '20 at 18:54
  • 6
    @user136036, "better" does not always mean fastest. Sometimes readability and maintainability are also important. – John La Rooy Jan 21 '20 at 00:55
  • How would you do this to match `jpg|jpeg|JPG|JPEG|png|PNG`? – Austin Apr 12 '20 at 07:24
  • AFAIK `Path` defaults to current directory so `'.'` is not necessary: `Path().rglob("foo")` – Granitosaurus Jun 03 '20 at 08:49
234

Changed in Python 3.5: Support for recursive globs using “**”.

glob.glob() got a new recursive parameter.

If you want to get every .txt file under my_path (recursively including subdirs):

import glob

files = glob.glob(my_path + '/**/*.txt', recursive=True)

# my_path/     the dir
# **/       every file and dir under my_path
# *.txt     every file that ends with '.txt'

If you need an iterator you can use iglob as an alternative:

for file in glob.iglob(my_path, recursive=True):
    # ...
Rotareti
  • 49,483
  • 23
  • 112
  • 108
56

This seems to be the fastest solution I could come up with, and is faster than os.walk and a lot faster than any glob solution.

  • It will also give you a list of all nested subfolders at basically no cost.
  • You can search for several different extensions.
  • You can also choose to return either full paths or just the names for the files by changing f.path to f.name (do not change it for subfolders!).

Args: dir: str, ext: list.
Function returns two lists: subfolders, files.

See below for a detailed speed anaylsis.

def run_fast_scandir(dir, ext):    # dir: str, ext: list
    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files


subfolders, files = run_fast_scandir(folder, [".jpg"])

In case you need the file size, you can also create a sizes list and add f.stat().st_size like this for a display of MiB:

sizes.append(f"{f.stat().st_size/1024/1024:.0f} MiB")

Speed analysis

for various methods to get all files with a specific file extension inside all subfolders and the main folder.

tl;dr:

  • fast_scandir clearly wins and is twice as fast as all other solutions, except os.walk.
  • os.walk is second place slighly slower.
  • using glob will greatly slow down the process.
  • None of the results use natural sorting. This means results will be sorted like this: 1, 10, 2. To get natural sorting (1, 2, 10), please have a look at:

Results:

fast_scandir    took  499 ms. Found files: 16596. Found subfolders: 439
os.walk         took  589 ms. Found files: 16596
find_files      took  919 ms. Found files: 16596
glob.iglob      took  998 ms. Found files: 16596
glob.glob       took 1002 ms. Found files: 16596
pathlib.rglob   took 1041 ms. Found files: 16596
os.walk-glob    took 1043 ms. Found files: 16596

Updated: 2022-07-20 (Py 3.10.1 looking for *.pdf)

glob.iglob      took 132 ms. Found files: 9999
glob.glob       took 134 ms. Found files: 9999
fast_scandir    took 331 ms. Found files: 9999. Found subfolders: 9330
os.walk         took 695 ms. Found files: 9999
pathlib.rglob   took 828 ms. Found files: 9999
find_files      took 949 ms. Found files: 9999
os.walk-glob    took 1242 ms. Found files: 9999

Tests were done with W7x64, Python 3.8.1, 20 runs. 16596 files in 439 (partially nested) subfolders.
find_files is from https://stackoverflow.com/a/45646357/2441026 and lets you search for several extensions.
fast_scandir was written by myself and will also return a list of subfolders. You can give it a list of extensions to search for (I tested a list with one entry to a simple if ... == ".jpg" and there was no significant difference).


# -*- coding: utf-8 -*-
# Python 3


import time
import os
from glob import glob, iglob
from pathlib import Path


directory = r"<folder>"
RUNS = 20


def run_os_walk():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
                  os.path.splitext(f)[1].lower() == '.jpg']
    print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_os_walk_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
    print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_iglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
    print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_pathlib_rglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(Path(directory).rglob("*.jpg"))
    print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def find_files(files, dirs=[], extensions=[]):
    # https://stackoverflow.com/a/45646357/2441026

    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1].lower() in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return


def run_fast_scandir(dir, ext):    # dir: str, ext: list
    # https://stackoverflow.com/a/59803793/2441026

    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files



if __name__ == '__main__':
    run_os_walk()
    run_os_walk_glob()
    run_glob()
    run_iglob()
    run_pathlib_rglob()


    a = time.time_ns()
    for i in range(RUNS):
        files = []
        find_files(files, dirs=[directory], extensions=[".jpg"])
    print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")


    a = time.time_ns()
    for i in range(RUNS):
        subf, files = run_fast_scandir(directory, [".jpg"])
    print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")
not2qubit
  • 14,531
  • 8
  • 95
  • 135
user136036
  • 11,228
  • 6
  • 46
  • 46
  • Great solution, but I had one issue with it that took me a bit to figure out. Using your `fast_scandir` code, when it hits a file path beginning with a `'.'` and without an extension, such as `.DS_Store` or `.gitignore`, the `if os.path.splitext(f.name)[1].lower() in ext` will always return true, which is literally asking `if '' in '.jpg'` in your example. I recommend adding a length check (i.e. `if len(os.path.splitext(f.name)[1]) > 0 and os.path.splitext(f.name)[1].lower() in ext`). – Brandon Hunter Oct 30 '20 at 17:03
  • 1
    @BrandonHunter, it does not return True. `print( os.path.splitext(".DS_Store")[1].lower() in [".jpg"] )` -> `False`. Keep in mind `ext` is a list and not a string. – user136036 Oct 31 '20 at 18:17
  • 2
    You can eliminate the recursive nature of this function by appending `dir` to `subfolders` at the beginning of the function and then adding an outer loop that iterates over `subfolders`. This should give a very small speed improvement, especially for very deep directory structures. It also frees up the function's output in case you need to return something other than `subfolders` and `files`. Note that depending on the way you add and access elements of `subfolders`, the ordering of the output could be different. – Leland Hepworth Jun 09 '21 at 19:10
  • 1
    Looks like this is not true. On benchmarking your code snippet for larger dataset, it takes more time than that of the code that uses glob. However the code works as expected. – Dhanashri P Sep 14 '21 at 13:26
  • 3
    `glob` is now **3x** faster than *fast_scandir* when using `Py 3.10.1`. – not2qubit Jul 20 '22 at 13:46
  • 1
    Further fast_scandir acctually does not run on all types of network shares since the recursion kills the drives capacities. Dont use it if you do serious stuff – Simitri69 Jun 24 '23 at 08:45
32

I will translate John La Rooy's list comprehension to nested for's, just in case anyone else has trouble understanding it.

result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

Should be equivalent to:

import glob
import os

result = []

for x in os.walk(PATH):
    for y in glob.glob(os.path.join(x[0], '*.txt')):
        result.append(y)

Here's the documentation for list comprehension and the functions os.walk and glob.glob.

Jefferson Lima
  • 5,186
  • 2
  • 28
  • 28
  • 1
    This answer worked for me in Python 3.7.3. `glob.glob(..., recursive=True)` and `list(Path(dir).glob(...'))` did not. – miguelmorin Jun 04 '19 at 21:13
22

The new pathlib library simplifies this to one line:

from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))

You can also use the generator version:

from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
    pass

This returns Path objects, which you can use for pretty much anything, or get the file name as a string by file.name.

Emre
  • 503
  • 4
  • 7
13

Your original solution was very nearly correct, but the variable "root" is dynamically updated as it recursively paths around. os.walk() is a recursive generator. Each tuple set of (root, subFolder, files) is for a specific root the way you have it setup.

i.e.

root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]

root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]

...

I made a slight tweak to your code to print a full list.

import os
for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,item))
            print(fileNamePath)

Hope this helps!

EDIT: (based on feeback)

OP misunderstood/mislabeled the subFolder variable, as it is actually all the sub folders in "root". Because of this, OP, you're trying to do os.path.join(str, list, str), which probably doesn't work out like you expected.

To help add clarity, you could try this labeling scheme:

import os
for current_dir_path, current_subdirs, current_files in os.walk(RECURSIVE_ROOT):
    for aFile in current_files:
        if aFile.endswith(".txt") :
            txt_file_path = str(os.path.join(current_dir_path, aFile))
            print(txt_file_path)
LastTigerEyes
  • 647
  • 1
  • 9
  • 21
9

Its not the most pythonic answer, but I'll put it here for fun because it's a neat lesson in recursion

def find_files( files, dirs=[], extensions=[]):
    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1] in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return

On my machine I have two folders, root and root2

mender@multivax ]ls -R root root2
root:
temp1 temp2

root/temp1:
temp1.1 temp1.2

root/temp1/temp1.1:
f1.mid

root/temp1/temp1.2:
f.mi  f.mid

root/temp2:
tmp.mid

root2:
dummie.txt temp3

root2/temp3:
song.mid

Lets say I want to find all .txt and all .mid files in either of these directories, then I can just do

files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)

#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']
dermen
  • 5,252
  • 4
  • 23
  • 34
9

You can do it this way to return you a list of absolute path files.

def list_files_recursive(path):
    """
    Function that receives as a parameter a directory path
    :return list_: File List and Its Absolute Paths
    """

    import os

    files = []

    # r = root, d = directories, f = files
    for r, d, f in os.walk(path):
        for file in f:
            files.append(os.path.join(r, file))

    lst = [file for file in files]
    return lst


if __name__ == '__main__':

    result = list_files_recursive('/tmp')
    print(result)

WilliamCanin
  • 101
  • 1
  • 3
4

Recursive is new in Python 3.5, so it won't work on Python 2.7. Here is the example that uses r strings so you just need to provide the path as is on either Win, Lin, ...

import glob

mypath=r"C:\Users\dj\Desktop\nba"

files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
    print(f) # nice looking single line per file

Note: It will list all files, no matter how deep it should go.

prosti
  • 42,291
  • 14
  • 186
  • 151
4

This function will recursively put only files into a list.

import os


def ls_files(dir):
    files = list()
    for item in os.listdir(dir):
        abspath = os.path.join(dir, item)
        try:
            if os.path.isdir(abspath):
                files = files + ls_files(abspath)
            else:
                files.append(abspath)
        except FileNotFoundError as err:
            print('invalid directory\n', 'Error: ', err)
    return files
Abhishek Boorugu
  • 164
  • 1
  • 1
  • 12
Yossarian42
  • 1,950
  • 17
  • 14
4

If you don't mind installing an additional light library, you can do this:

pip install plazy

Usage:

import plazy

txt_filter = lambda x : True if x.endswith('.txt') else False
files = plazy.list_files(root='data', filter_func=txt_filter, is_include_root=True)

The result should look something like this:

['data/a.txt', 'data/b.txt', 'data/sub_dir/c.txt']

It works on both Python 2.7 and Python 3.

Github: https://github.com/kyzas/plazy#list-files

Disclaimer: I'm an author of plazy.

Minh Nguyen
  • 292
  • 1
  • 7
2

You can use the "recursive" setting within glob module to search through subdirectories

For example:

import glob
glob.glob('//Mypath/folder/**/*',recursive = True)

The second line would return all files within subdirectories for that folder location (Note, you need the '**/*' string at the end of your folder string to do this.)

If you specifically wanted to find text files deep within your subdirectories, you can use

glob.glob('//Mypath/folder/**/*.txt',recursive = True)
user3376851
  • 149
  • 10
1

A simplest and most basic method:

import os
for parent_path, _, filenames in os.walk('.'):
    for f in filenames:
        print(os.path.join(parent_path, f))
Devymex
  • 446
  • 3
  • 17