1

I have a code that generates a python list and I sorted this list in such lines

mylist = list_files(MAIN_DIR)
print(sorted(mylist, key=lambda x: str(x.split('\\')[-1][:-4].replace('_',''))))

Now I got the python list sorted as desired. How can I split this main python list into suclists based on the similar pdf names? Have a look at the snapshot to get what I mean exactly enter image description here

Now I can loop through the keys and values of the grouped dictionary

for el in mylist:
    file_name = el.split('\\')[-1].split('.')[0]
    if file_name not in grouped_files.keys():
        grouped_files[file_name] = []
    grouped_files[file_name].append(el)

for key, value in grouped_files.items():
    pdfs = value
    merger = PdfFileMerger()
    for pdf in pdfs:
        merger.append(pdf)

    merger.write(OUTPUT_DIR / f'{key}.pdf')
    merger.close()

But I got an error AttributeError: 'WindowsPath' object has no attribute 'write'

YasserKhalil
  • 9,138
  • 7
  • 36
  • 95
  • 1
    When is the criteria "similar" fulfilled? Can you create a minimal working example? – Natan Mar 19 '22 at 21:32
  • In the snapshot, I put three spots on three sublist to explain what I mean. Look there are subfolders in the main folder, and each subfolder has pdf files. In those subfolders, it happened a lot that there are similar pdf names, so I thought of sorting the list. Then to create sublists from the main list [['...\\1.pdf', '...\\1.pdf'], ['...\\2.pdf', '...\\2.pdf', '...\\2.pdf']] – YasserKhalil Mar 19 '22 at 21:34
  • So you want to group by last char if said char is a number? – Natan Mar 19 '22 at 21:41
  • 1
    Not exactly, by the stem file name as solved by both Richard Dodson & paul-shuvo – YasserKhalil Mar 19 '22 at 21:44

2 Answers2

2

Here's a small example using groupby to create a dict of lists.

from itertools import groupby

mylist = ['..\\1\\1.pdf', '..\\2\\2.pdf', '..\\1\\1.pdf', '..\\2\\2.pdf']

key=lambda x: str(x.split('\\')[-1][:-4].replace('_',''))

results = {}
for filename, grouped in groupby(sorted(mylist, key=key), key=key):
    results[filename] = list(grouped)

results is a dictionary, with each key being the same key as was used to sort. Depending on what exactly you're looking to take into consideration, you can use different keys to derive different lists. One thing to note when using groupby is that if you want to make sure you're getting only a single set of data for each key, you need to sort by that key first. The other thing to note is that grouped is a generator object, not a list. This is a means of efficiency, but if you want a list, you can call list(grouped) to convert into a list, as shown.

>>> import pprint
>>> pprint.pprint(results)
{'1': ['..\\1\\1.pdf', '..\\1\\1.pdf'], '2': ['..\\2\\2.pdf', '..\\2\\2.pdf']}
>>>

It can also be solved with a defaultdict. This doesn't require having to sort anything first.

from collections import defaultdict

mylist = ['..\\1\\1.pdf', '..\\2\\2.pdf', '..\\1\\1.pdf', '..\\2\\2.pdf']

key=lambda x: str(x.split('\\')[-1][:-4].replace('_',''))

results = defaultdict(list)

for filename in mylist:
   results[key(filename)].append(filename)


Which results in

>>> import pprint
>>> pprint.pprint(results)
defaultdict(<class 'list'>,
            {'1': ['..\\1\\1.pdf', '..\\1\\1.pdf'],
             '2': ['..\\2\\2.pdf', '..\\2\\2.pdf']})

Using defaultdict means you can reference a value, in this case the same sort key, in a dictionary and get a default value back instead of a ValueError. It's set to return an empty list if it sees a new value, so we can append to the value every time.

  • Thanks again. I didn't realize you have updated the answer. As for using defaultdict, how can I get the second item? I tried `pprint.pprint(results[1])` but returns an empy list – YasserKhalil Mar 20 '22 at 04:39
  • I could loop through it `for item in results:` then `print(results[item])`. Thanks a lot. – YasserKhalil Mar 20 '22 at 04:43
1

I'm guessing you want directory paths grouped by file names. If that's the case, my approach will be something like the following:

mylist = list_files(MAIN_DIR)

grouped_files = {}
for el in mylist:
    file_name = el.split('\\')[-1].split('.')[0]
    if file_name not in grouped_files.keys():
        grouped_files[file_name] = []
    grouped_files[file_name].append(el)
paul-shuvo
  • 1,874
  • 4
  • 33
  • 37
  • 1
    You could also avoid the if statement by using a defaultdict – Ben Grossmann Mar 19 '22 at 21:39
  • Amazing. That's exactly what I am trying to do. But can you spot on how to loop through the dictionary items as I intend to get the filename of the output from the key and the value of the dictionary will be the list that I have to merge in one pdf file? – YasserKhalil Mar 19 '22 at 21:40
  • 1
    You may want to use something like `for key, values in grouped_files.items():`. This [link](https://stackoverflow.com/questions/3294889/iterating-over-dictionaries-using-for-loops) might help. – paul-shuvo Mar 19 '22 at 21:46
  • @paul-shuvo I have edited the question and added my code. Please have a look. I am not sure of the error that appeared. It's supposed to merge similar pdf names into one pdf and store it in `OUTPUT` folder `OUTPUT_DIR = BASE_DIR / 'OUTPUT'`. And the main folder `MAIN_DIR = BASE_DIR / 'MAIN'` – YasserKhalil Mar 19 '22 at 22:32