-1
dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# one of subdirs contain huge number of files
files = [os.path.join(file, f) for file in subdirs for f in os.listdir(file)]

The code ran smoothly first few times under 30 seconds but over different runs of the same code, the time increased to 11 minutes and now not even running in 11 minutes. The problem is in the 3rd line and I suspect os.listdir for this.

EDIT: Just want to read the files so that it can be sent as argument to a multiprocessing function. RAM is also not an issue as RAM is ample and not even 1/10th of RAM is used by the program

Mann
  • 89
  • 1
  • 7
  • 1
    If the number of files is **huge** you may be constrained by RAM. Use whatever tools you have available to you to monitor memory usage. Also explain what you're trying to do here. os.walk() may be more appropriate – DarkKnight Jan 05 '23 at 11:54
  • Updated. RAM is no the issue and os.listdir was working perfect n first few runs – Mann Jan 05 '23 at 11:58
  • If the directory is not growing over time, my suspicion would be that the multiprocessing whatchamacallit is the actual problem and grows to eat up system resources, RAM or otherwise. – tripleee Jan 05 '23 at 12:14
  • 1
    First, repeatedly parsing the same directory tree seems highly inefficient. You should probably consider moving the files once read to a staging area, then the next search of the tree has many fewer files to trawl. Looping through a list of directories (even using comprehensions) is always going to be slow (30s is slow, though why it degrades indo not know), but using a single call to glob (with `*/*/*`) is likely going to be *significantly* faster. https://stackoverflow.com/questions/7159607/list-directories-with-a-specified-depth-in-python – MatBailie Jan 05 '23 at 12:18
  • 1
    *"The code ran smoothly first few times"* - is this ***all*** your code? What happens in the rest of it? Are you by any chance creating files? Maybe in a loop? That gets exponentially bigger and hence future runs take longer... – Tomerikoo Jan 05 '23 at 12:19
  • @Tomerikoo yes this was the only code just retrieval. Happened again today. It ran in 30s and took 10mins when I posted this issue – Mann Jan 06 '23 at 18:50

1 Answers1

-1

It might leads that os.listdir(dir_) reads the entire directory tree and returns a list of all the files and subdirectories in dir_. This process can take a long time if the directory tree is very large or if the system is under heavy load.

But instead that use either below method or use walk() method.

dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]

# Create an empty list to store the file paths
files = []

for subdir in subdirs:
    # Use os.scandir() to iterate over the files and directories in the subdirectory
    with os.scandir(subdir) as entries:
        for entry in entries:
            # Check if the entry is a regular file
            if entry.is_file():
                # Add the file path to the list
                files.append(entry.path)

Myth
  • 338
  • 7
  • [`os.listdir`](https://docs.python.org/3/library/os.html#os.listdir) - *"Return a list containing the names of the entries in the directory given by path"* – Tomerikoo Jan 05 '23 at 12:16
  • Loops like this are going to be ssssssllllllooooowwwwww – MatBailie Jan 05 '23 at 12:23