I'm trying to print a list of all the files in the specified directory that end in .pdf
Once that is running I want to expand it to print out the number of files that are named "unnamed document" or end in .pdf.pdf.pdf which is a problem among the 1,200 or so books I've collected in iBooks.
After it prints out the .pdf file I'm trying to get it to trim off the excess .pdf extensions and somehow prompt me to edit each file which I'll have to do manually after reviewing the first few pages of each "unnamed document"
While I would love all the code spelled out for me, I would more appreciate hints or tips on how to go about learning how to do this.
The directory was found from this page. https://www.idownloadblog.com/2018/05/24/ibooks-library-location-mac/
The script I've found was from here Get list of pdf files in folder
I print this currently and get EOF errors and type errors so I'm asking for help on how to structure or revise this script for the start of a larger data cleaning project.
While this can be done with regex, I prefer this done with python. Remove duplicate filename extensions
Thanks!
First version
#!/usr/bin/env python3
import os
all_files = []
for dirpath, dirnames, filenames in os.walk("~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents"):
for filename in [f for f in filenames if f.endswith(".pdf")]:
all_files.append(os.path.join(dirpath, filename)
# print (files ending in .pdf.pdf.etc)
# trim file names with duplicate .pdf names
# print(files named "unnamed document")
End of the First version
Start of the Second version
The second version, well I've now realized after reading a few other blogs that this is a relatively known and solved problem. So the script I've changed to using was found online from 2013 that uses hashes to compare the files and it claims quite quickly. Then the script as shown below just needs the name of a subdirectory in the terminal (where you will need to run it) and press enter.
testmachine at testmachine-MacPro in ~sandbox/test
$python3 dupFinder.py venv $testdirectory
results in
Duplicates Found:
The following files are identical. The name could differ, but the content is identical
___________________
venv/bin/easy_install
venv/bin/easy_install-3.6
___________________
venv/bin/pip
venv/bin/pip3
venv/bin/pip3.6
___________________
venv/bin/python
venv/bin/python3
___________________
venv/lib/python3.6/site-packages/six.py
venv/lib/python3.6/site-packages/pip/_vendor/six.py
___________________
venv/lib/python3.6/site-packages/wq-1.1.0-py3.6-nspkg.pth
venv/lib/python3.6/site-packages/wq.app-1.1.0-py3.6-nspkg.pth
venv/lib/python3.6/site-packages/wq.core-1.1.0-py3.6-nspkg.pth
venv/lib/python3.6/site-packages/wq.db-1.1.2-py3.6-nspkg.pth
venv/lib/python3.6/site-packages/wq.io-1.1.0-py3.6-nspkg.pth
venv/lib/python3.6/site-packages/wq.start-1.1.1-py3.6-nspkg.pth
I really like the simplicity of this script and posting it here but also giving credit to [pythoncentral][1] for creating this baller code back in 2013. That still runs despite six years but also works flawlessly.
# dupFinder.py
import os, sys
import hashlib
def findDup(parentFolder):
# Dups in format {hash:[names]}
dups = {}
for dirName, subdirs, fileList in os.walk(parentFolder):
print('Scanning %s...' % dirName)
for filename in fileList:
# Get the path to the file
path = os.path.join(dirName, filename)
# Calculate hash
file_hash = hashfile(path)
# Add or append the file path
if file_hash in dups:
dups[file_hash].append(path)
else:
dups[file_hash] = [path]
return dups
# Joins two dictionaries
def joinDicts(dict1, dict2):
for key in dict2.keys():
if key in dict1:
dict1[key] = dict1[key] + dict2[key]
else:
dict1[key] = dict2[key]
def hashfile(path, blocksize=65536):
afile = open(path, 'rb')
hasher = hashlib.md5()
buf = afile.read(blocksize)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(blocksize)
afile.close()
return hasher.hexdigest()
def printResults(dict1):
results = list(filter(lambda x: len(x) > 1, dict1.values()))
if len(results) > 0:
print('Duplicates Found:')
print('The following files are identical. The name could differ, but the content is identical')
print('___________________')
for result in results:
for subresult in result:
print('\t\t%s' % subresult)
print('___________________')
else:
print('No duplicate files found.')
if __name__ == '__main__':
if len(sys.argv) > 1:
dups = {}
folders = sys.argv[1:]
for i in folders:
# Iterate the folders given
if os.path.exists(i):
# Find the duplicated files and append them to the dups
joinDicts(dups, findDup(i))
else:
print('%s is not a valid path, please verify' % i)
sys.exit()
printResults(dups)
else:
print('Usage: python dupFinder.py folder or python dupFinder.py folder1 folder2 folder3')
[1]: https://www.pythoncentral.io/finding-duplicate-files-with-python/
The third version and forward needs to evolve a few things.
There is room for improvement in the UI, a terminal or headless gui, saving the results in a log or csv file. Eventually moving this to a Flask or Django app could prove beneficial. Course this being a pdf document scrubber I could create a que of files named "unnamed document" that the machine could log its hash or create an index that it would save could be improved speed next time it wouldnt scan whole again just show me the "unnamed documents" that need work. Work could be defined as scrubbing the name, deduping, finding the cover page, adding keywords or even creating a que file for the reader to actually read each document. Maybe there is an API for good reads ? Any menu or gui would need to include error handling as well as starting with somekind of cron job and saving results somewhere.well as the intelligence behind this so it starts to learn the steps you are taking over time.
Ideas ?