0

I have a multiple directory and each has files in thousands(10k+).. Lets pick one directory A having 10k files . I have some another directory(say it as B) that has files in thousands. I'm trying to find all files that appear in both A and B and also have a particular file extension (let's say .docx). I can apply a nested for loop easily, but as the files are in many thousands, it takes lot of time. Is there any faster way in python to perform it? Any specific algo you want to suggest or any snippet code ?

Note - I know how to search and get files in multiple ways, I am asking suggestion for the fastest approach, files are in millions and iterating through each again and again will cost resource..

glibdud
  • 7,550
  • 4
  • 27
  • 37
steveJ
  • 2,171
  • 3
  • 11
  • 16
  • Possible duplicate of [Find all files in a directory with extension .txt in Python](https://stackoverflow.com/questions/3964681/find-all-files-in-a-directory-with-extension-txt-in-python) – jolindbe Aug 30 '18 at 13:59
  • You can use glob or simply os.listdir – Suraj Kothari Aug 30 '18 at 14:00
  • I know to search, I am looking for the fastest approach , no. of files are in almost half millions spreaded across each directories.. – steveJ Aug 30 '18 at 14:00
  • @jolindbe I do not think that is what I am looking for the link you mentioned .. please check my question again... – steveJ Aug 30 '18 at 14:04
  • It might help if you gave a more concrete example. Show an example of a few files in `A`, a few in `B`, and exactly what results you would expect and why. – glibdud Aug 30 '18 at 14:05
  • @glibdud I am not looking for any code , but you can refer to the link you provided - like same I can do but I am trying to find the optimal way as no. of files are in millions ... – steveJ Aug 30 '18 at 14:13
  • @steveJ I didn't provide a link. What I'm saying is that I don't know what "I have to iterate through all the files in `B` and check which all files available in `A`" means, so I don't know what exactly you're trying to do. An example might help that. Or maybe just a more thorough explanation. – glibdud Aug 30 '18 at 14:17
  • @glibdud its somewhat like this.. `for i in B: for k in A: if k.endswith('.docs'): if i==k: ` please ignore if there is some syntax issue, type here fastly . Also I have considered `A` and `B` directory here .. not going into how i am traversing dir using listdir or os.walk... just an example code // – steveJ Aug 30 '18 at 14:23
  • @steveJ So you're looking for files that are in both `A` and `B` and also have a particular extension? – glibdud Aug 30 '18 at 14:26
  • Yes .. please keep in mind in both the directory files are in millions and have to search for all files to match .. – steveJ Aug 30 '18 at 14:27
  • @steveJ I edited the question to get this clarification in there. Feel free to revert or modify if I missed the mark. – glibdud Aug 30 '18 at 14:33
  • @glibdud Thank You ! – steveJ Aug 30 '18 at 14:35

3 Answers3

1

The canonical method for comparing directories in python appears to be filecmp.dircmp().

cmp = filecmp.dircmp('/path/to/A', '/path/to/B')
matchingfiles = [filename for filename in cmp.common_files if filename.endswith('.docx')]

I can't speak specifically to its performance, but I would assume it's implemented in a way that will be more efficient than nested for loops.

glibdud
  • 7,550
  • 4
  • 27
  • 37
0

You can something like this:

import os
[x for x in os.listdir('A') if x.endswith('.docx')]

This will select the '.docx' files in the 'A' folder.

kevh
  • 323
  • 2
  • 6
0

Try the glob module:

import glob
glob.glob('/*')

Output (Ubuntu 18.04):

['/bin', '/boot', '/cache', '/data', '/dev', '/etc', '/home', '/init', '/lib', '/lib64', '/media', '/mnt', '/opt', '/proc', '/root', '/run', '/sbin', '/snap', '/srv', '/sys', '/tmp', '/usr', '/var']

Of course, you can glob something else:

glob.glob("*.docx")
iBug
  • 35,554
  • 7
  • 89
  • 134
  • It it significantly faster ? The problem is that , in dir `B` there is say 10K files and in folder `A` there is 20K files - Lets say/ For each file in dir `B` i have to iterate to find whether it is available in folder `A` and this takes lot of computation ... I mean the file that was already iterated in folder `A` will get iterate again and again , as it is in millions the number of itteration you can understand... – steveJ Aug 30 '18 at 14:11