Find files with same initial part of the name in two folders

Question

I used listdir to read the files in two folders:

from os import listdir 
list_1 = [file for file in listdir("./folder1/") if file.endswith(".csv")]
list_2 = [file for file in listdir("./folder2/") if file.endswith(".json")]

and now I have two lists:

list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']

I want to find the corresponding two sublists containing those files whose initial part of the name is the same. In other words:

list_1b = ['12_a1_pp.csv', '32_a3_pp.csv']
list_2b = ['12_a1.json', '32_a3.json']

How can I do that?

PS note that the listdir part may not matter to answer the question. I only included it, because if the result of listdir is guaranteed to be in alphabetical order, then that might help in traversing the two lists. Of course in this simple example the lists are short, but in the real use case they cointain hundreds of files.

I would personally look into using something like the glob pattern matching here: https://docs.python.org/2/library/glob.html — Taku_, Apr 09 '18 at 12:38
Why don't you just remove the special substrings "_pp.scv" from list 1 and ".json" from list two and make an equality test with two nested loops? — YesThatIsMyName, Apr 09 '18 at 12:40

jpp · Answer 1 · 2018-04-09T12:53:41.387

2

This is one way using dictionary comprehensions and set.intersection.

list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']

start_1 = {k: '_'.join(k.split('_')[:-1]) for k in list_1}
start_2 = {k: k.split('.')[0] for k in list_2}

start_intersect = set(start_1.values()) & set(start_2.values())

list_1b = [k for k, v in start_1.items() if v in start_intersect]
list_2b = [k for k, v in start_2.items() if v in start_intersect]

This method works equally well if you have filenames ending in "_XY.csv" for any "XY". It relies on the format of the filename rather than the invidual letters.

edited Apr 09 '18 at 12:53

answered Apr 09 '18 at 12:40

jpp

159,742
34
281
339

performance is indeed important. Could you show how to eliminate the repeated splitting? – DeltaIV Apr 09 '18 at 12:49
`'_'.join(string_list)` means to join the elements of the list of strings `string_list` using the delimiter `'_'`, correct? – DeltaIV Apr 09 '18 at 12:52
@DeltaIV, see update. And your understanding of `str.join` is correct. – jpp Apr 09 '18 at 12:53
great! Would +2 if I could :-) but shouldn't the last two lines be `list_1b = [k for k, v in list_1.items() if v in start_intersect]` and `list_2b = [k for k, v in list_2.items() if v in start_intersect]` ? – DeltaIV Apr 09 '18 at 12:57
@DeltaIV, no, that's not necessary. The dictionaries already contain all the items from your lists by construction. Also, lists have no `items` method. – jpp Apr 09 '18 at 12:57
ah, got it! `k` is the key and `v` is the item. Now I understand. – DeltaIV Apr 09 '18 at 12:59

sciroccorics · Accepted Answer · 2018-04-09T13:46:44.930

2

A more pythonic approach would use the & (intersection) operator for sets:

common = set(x[:-7] for x in list_1) & set(x[:5] for x in list_2)
list_1b = [x + '_pp.csv' for x in common]
list_2b = [x + '.json' for x in common]

EDIT : If you need to split on a specific character (see comment) for each list, here is an updated version (search for the last '_' in list_1 and search for the last '.' in list_2):

common = set(x[:x.rindex('_')] for x in list_1) & set(x[:x.rindex('.')] for x in list_2)

edited Apr 09 '18 at 13:46

answered Apr 09 '18 at 12:52

sciroccorics

2,357
1
8
21

this is nice, but I don't like the fact that it relies on counting the number of characters. A `split` or similar approach would be cleaner. – DeltaIV Apr 09 '18 at 13:03
@DeltaIV: you are right, a generic solution is better than a specific one. But your OP doesn't give us any info on the variation in the filenames you may encounter. Even the split approach is undefined : should we split on the last underscore ? on the second last underscore ? or something else ? As long as you give no details on your data, people can only provide solutions that work on your example... – sciroccorics Apr 09 '18 at 13:12
@DeltaIV: last remark, you said that performance is important for your application, so it is no very logical to favor the dictionary approach proposed by jpp, which is almost twice as slow as the approach based on pure sets (I've obtained 7.59s vs. 4.08s for 10M execs). – sciroccorics Apr 09 '18 at 13:29
concerning timings: how do you benchmark in Python? Any references? In R there is a very nice package `microbenchmark`, I don't know a corresponding module in Python. – DeltaIV Apr 09 '18 at 13:31
1

The easiest solution is to use IPython and its magic command %timeit. See [this link](https://stackoverflow.com/questions/29280470/what-is-timeit-in-python?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) for additional info – sciroccorics Apr 09 '18 at 13:36

score 1 · Answer 3 · answered Apr 09 '18 at 12:42

list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']

list_1_C = [i.split(".")[0].replace("_pp", "") for i in list_1]     #Check List
list_2_C = [i.split(".")[0] for i in list_2]                        #Check List

print([list_1[i] for i, v in enumerate(list_1_C) if v in list_2_C])
print([list_2[i] for i, v in enumerate(list_2_C) if v in list_1_C])

Output:

['12_a1_pp.csv', '32_a3_pp.csv']
['12_a1.json', '32_a3.json']

Kenstars · Answer 4 · 2018-04-09T12:54:17.103

This is simple when you think about it so here goes:

list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
starters = [eachfile.partition(".")[0] for eachfile in list2]
 for eachelement in starters:
    for eachfile in list_1:
       if eachfile.startswith(eachelement):
          list_1b.append(eachfile)
          list_2b.append(eachelement+".json")

Furthermore if you want specific to this case:

collective_set_1 = { each.replace("_pp.csv","") for each in list_1}
collective_set_2 = { each.replace(".json","") for each in list_2}
intersection = collective_set_1.intersection(collective_set2)
list_1b = [ each+"_pp.csv" for each in intersection ]
list_2b = [ each+".json" for each in intersection ]

Find files with same initial part of the name in two folders

4 Answers4