Data cleaning file names in iBooks directory with python

Question

I'm trying to print a list of all the files in the specified directory that end in .pdf

Once that is running I want to expand it to print out the number of files that are named "unnamed document" or end in .pdf.pdf.pdf which is a problem among the 1,200 or so books I've collected in iBooks.

After it prints out the .pdf file I'm trying to get it to trim off the excess .pdf extensions and somehow prompt me to edit each file which I'll have to do manually after reviewing the first few pages of each "unnamed document"

While I would love all the code spelled out for me, I would more appreciate hints or tips on how to go about learning how to do this.

The directory was found from this page. https://www.idownloadblog.com/2018/05/24/ibooks-library-location-mac/

The script I've found was from here Get list of pdf files in folder

I print this currently and get EOF errors and type errors so I'm asking for help on how to structure or revise this script for the start of a larger data cleaning project.

While this can be done with regex, I prefer this done with python. Remove duplicate filename extensions

Thanks!

First version

#!/usr/bin/env python3

import os

all_files = []
for dirpath, dirnames, filenames in os.walk("~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents"):
    for filename in [f for f in filenames if f.endswith(".pdf")]:
        all_files.append(os.path.join(dirpath, filename)

# print (files ending in .pdf.pdf.etc)

# trim file names with duplicate .pdf names

# print(files named "unnamed document")

End of the First version

Start of the Second version

The second version, well I've now realized after reading a few other blogs that this is a relatively known and solved problem. So the script I've changed to using was found online from 2013 that uses hashes to compare the files and it claims quite quickly. Then the script as shown below just needs the name of a subdirectory in the terminal (where you will need to run it) and press enter.

testmachine at testmachine-MacPro in ~sandbox/test 
$python3 dupFinder.py venv $testdirectory

results in

Duplicates Found:
The following files are identical. The name could differ, but the content is identical
___________________
        venv/bin/easy_install
        venv/bin/easy_install-3.6
___________________
        venv/bin/pip
        venv/bin/pip3
        venv/bin/pip3.6
___________________
        venv/bin/python
        venv/bin/python3
___________________
        venv/lib/python3.6/site-packages/six.py
        venv/lib/python3.6/site-packages/pip/_vendor/six.py
___________________
        venv/lib/python3.6/site-packages/wq-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.app-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.core-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.db-1.1.2-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.io-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.start-1.1.1-py3.6-nspkg.pth

I really like the simplicity of this script and posting it here but also giving credit to [pythoncentral][1] for creating this baller code back in 2013. That still runs despite six years but also works flawlessly.

# dupFinder.py
import os, sys
import hashlib


def findDup(parentFolder):
    # Dups in format {hash:[names]}
    dups = {}
    for dirName, subdirs, fileList in os.walk(parentFolder):
        print('Scanning %s...' % dirName)
        for filename in fileList:
            # Get the path to the file
            path = os.path.join(dirName, filename)
            # Calculate hash
            file_hash = hashfile(path)
            # Add or append the file path
            if file_hash in dups:
                dups[file_hash].append(path)
            else:
                dups[file_hash] = [path]
    return dups


# Joins two dictionaries
def joinDicts(dict1, dict2):
    for key in dict2.keys():
        if key in dict1:
            dict1[key] = dict1[key] + dict2[key]
        else:
            dict1[key] = dict2[key]


def hashfile(path, blocksize=65536):
    afile = open(path, 'rb')
    hasher = hashlib.md5()
    buf = afile.read(blocksize)
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    afile.close()
    return hasher.hexdigest()


def printResults(dict1):
    results = list(filter(lambda x: len(x) > 1, dict1.values()))
    if len(results) > 0:
        print('Duplicates Found:')
        print('The following files are identical. The name could differ, but the content is identical')
        print('___________________')
        for result in results:
            for subresult in result:
                print('\t\t%s' % subresult)
            print('___________________')

    else:
        print('No duplicate files found.')


if __name__ == '__main__':
    if len(sys.argv) > 1:
        dups = {}
        folders = sys.argv[1:]
        for i in folders:
            # Iterate the folders given
            if os.path.exists(i):
                # Find the duplicated files and append them to the dups
                joinDicts(dups, findDup(i))
            else:
                print('%s is not a valid path, please verify' % i)
                sys.exit()
        printResults(dups)
    else:
        print('Usage: python dupFinder.py folder or python dupFinder.py folder1 folder2 folder3')


  [1]: https://www.pythoncentral.io/finding-duplicate-files-with-python/

The third version and forward needs to evolve a few things.

There is room for improvement in the UI, a terminal or headless gui, saving the results in a log or csv file. Eventually moving this to a Flask or Django app could prove beneficial. Course this being a pdf document scrubber I could create a que of files named "unnamed document" that the machine could log its hash or create an index that it would save could be improved speed next time it wouldnt scan whole again just show me the "unnamed documents" that need work. Work could be defined as scrubbing the name, deduping, finding the cover page, adding keywords or even creating a que file for the reader to actually read each document. Maybe there is an API for good reads ? Any menu or gui would need to include error handling as well as starting with somekind of cron job and saving results somewhere.well as the intelligence behind this so it starts to learn the steps you are taking over time.

Ideas ?

abhilb · Answer 1 · 2020-01-02T08:22:00.413

0

I would suggest you to have a look at the pathlib library. link

To list out all the files with extension pdf

from pathlib import Path

pdf_files = list(map(str,Path("~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents").glob("**/*.pdf")))

print(pdf_files)

To trim extra .pdf

pdf_files = [x[:-4] if x.endswith('.pdf.pdf') else x for x in pdf_files]

edited Jan 02 '20 at 08:22

answered Jan 01 '20 at 05:11

abhilb

5,639
2
20
26

Thanks abhib. If I am reading your second segment of code "to trim extra .pdf" as # Create object pdf_files is equal to the list of x.strip, which strips or removes pdfs after it looks for .pdf extensions and for each .pdf in x files that stem or end with pdf. Else, it continues to look for pdf files in the x for x in the directory? – CoffeeBaconAddict Jan 02 '20 at 00:44
I put your code into pycharm, sublime and the terminal line by line and it only outputs a " [ ] ". Do I need to somehow escape the binary writing of the object or would it write easier to a csv file? – CoffeeBaconAddict Jan 02 '20 at 00:48
No change in the file. Here is a reprint from the terminal of the files. ADMISSION INTAKE FORM.pdf.pdf ADVANCED ARTIFICIAL INTELLIGENCE.pdf.pdf.pdf.pdf ADVANCED_MACHINE_LEARNING_WITH_PYTHON.pdf.pdf.pdf.pdf – CoffeeBaconAddict Jan 02 '20 at 08:58
Is the first part of the answer correct? Are you getting the names of all the pdf files? – abhilb Jan 02 '20 at 09:00
$ python3 pdf_scrubber.py returns a pair of list brackets '[ ]'. I even went into the terminal, pulled pwd and added the whole PATH to the string for no change in output. If it is generating [ ] this is kinda positive. Shouldn't we be adding something to it like how to format the output as a string or a dictionary something? The "list(map(str,Path(" part should do this but we are obviously missing something in the formatting or accessing the file. We could test this idea by changing the script to output the list to a CSV file? – CoffeeBaconAddict Jan 02 '20 at 22:26
I went a little further and found an old CSV writer and put that in, the same result is an empty csv. The directory ( i just rechecked ) is filled with hundreds of files that look like these files below. unnamed document.pdf.pdf.pdf unnamed document.pdf.pdf.pdf.pdf.pdf.pdf – CoffeeBaconAddict Jan 02 '20 at 23:57
I've been working on this again and once I removed the ~ from the path it is now showing the file but there are over 605 files in the directory. I want it to count all the.pdf files, trim excess .pdf.pdf down to one and .pdf.icloud changed to .pdf. – CoffeeBaconAddict Feb 17 '20 at 06:34

Data cleaning file names in iBooks directory with python

1 Answers1