0

How to get Absolute file path within a specified directory and ignore dot(.) directories and dot(.)files

I have below solution, which will provide a full path within the directory recursively,

Help me with the fastest way of list files with full path and ignore .directories/ and .files to list

(Directory may contain 100 to 500 millions files )

import os

def absoluteFilePath(directory):
    for dirpath,_,filenames in os.walk(directory):
        for f in filenames:
            yield os.path.abspath(os.path.join(dirpath, f))


for files in absoluteFilePath("/my-huge-files"):
    #use some start with dot logic ? or any better solution

Example:

/my-huge-files/project1/file{1..100} # Consider all files from file1 to 100
/my-huge-files/.project1/file{1..100} # ignore .project1 directory and its files (Do not need any files under .(dot) directories)
/my-huge-files/project1/.file1000 # ignore .file1000, it is starts with dot 
Pejman Kheyri
  • 4,044
  • 9
  • 32
  • 39
Vidya
  • 547
  • 1
  • 10
  • 26
  • see this https://stackoverflow.com/questions/13454164/os-walk-without-hidden-folders/25246828. It may not 100% fit your need since walk goes to sub folders of hidden folders, but implement your own recursive version is simple, just go back to os.listdir – Bing Wang Mar 01 '21 at 05:29

2 Answers2

0

os.walk by definition visits every file in a hierarchy, but you can select which ones you actually print with a simple textual filter.

for file in absoluteFilePath("/my-huge-files"):
    if '/.' not in file:
        print(file)

When your starting path is already absolute, calling os.path.abspath on it is redundant, but I guess in the great scheme of things, you can just leave it in.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Privately I'm wondering whether you really need absolute paths; many people are convinced they need them because they simply don't understand the difference between absolute and relative paths. Maybe see https://stackoverflow.com/questions/31435921/difference-between-and – tripleee Mar 01 '21 at 05:47
0

Don't use os.walk() as it will visit every file
Instead, fall back to .scandir() or .listdir() and write your own implementation

You can use pathlib.Path(test_path).expanduser().resolve() to fully expand a path

import os
from pathlib import Path

def walk_ignore(search_root, ignore_prefixes=(".",)):
    """ recursively walk directories, ignoring files with some prefix
        pass search_root as an absolute directory to get absolute results
    """
    for dir_entry in os.scandir(Path(search_root)):
        if dir_entry.name.startswith(ignore_prefixes):
            continue
        if dir_entry.is_dir():
            yield from walk_ignore(dir_entry, ignore_prefixes=ignore_prefixes)
        else:
            yield Path(dir_entry)

You may be able to save some overhead with a closure, coercing to Path once, yielding only .name, etc., but that's really up to your needs

Also not to your question, but related to it; if the files are very small, you'll likely find that packing them together (several files in one) or tuning the filesystem block size will see tremendously better performance

Finally, some filesystems come with bizarre caveats specific to them and you can likely break this with oddities like symlink loops

ti7
  • 16,375
  • 6
  • 40
  • 68