0

I'm trying to consider the best way of running through a directory tree and identifying specific excel files, then moving on to manipulating them in pandas.

I tried to identify the files I want by scanning for the names of the files, (data) but I realised it would be far more effective if I was able to identify files by their authors. How would I be able to redo the example below from searching 'data' to searching for the author of the file?

I added file.lower() in my example as some files might contain Data or DATA in the file name. If there is a better way of doing this, and if there is a good resource for learning more about manipulating files as described in my post, I would be grateful to hear about it.

import os
import shutil
for folderName, subfolders, filenames in os.walk(r'dir\Documents'):

        for file in filenames:
                file.lower()
                if 'data' in file:
                        try: shutil.copy(os.path.join(folderName, file), 'C:\\dir\ALL DATA')
                        except: 
                                print(folderName, file)
Iwan
  • 309
  • 1
  • 6
  • 17
  • lower() doesn't change a string in place, but returns a string. So you should do something like lower_file = file.lower() Also, do all of the excel files have an author specified in the file itself? – LeopoldVonBuschLight Aug 15 '17 at 19:46
  • In the example above, you probably wanted `if "data" if file.lower():` – Alexander Aug 15 '17 at 19:47

2 Answers2

4

This will search the directory for excel files that have the creator_name in the excel file metadata:

import os
import zipfile, lxml.etree
import shutil

def find_excel_files(directory, creator_name):
    for folderName, subfolders, filenames in os.walk(directory):
        for f in filenames:
            if f.endswith('.xlsx'):
                f_path = os.path.join(folderName, f)
                f_creators = get_xlsx_creators(f_path)
                if creator_name in f_creators:
                    # One of the creators of the excel file matches creator_name
                    # Do something like copy the file somewhere...
                    print('Found a match: {}'.format(f_path))



def get_xlsx_creators(xlsx_path):
    # open excel file (xlsx files are just zipfiles)
    zf = zipfile.ZipFile(xlsx_path)
    # use lxml to parse the xml file we are interested in
    doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
    # retrieve creator
    ns={'dc': 'http://purl.org/dc/elements/1.1/'}
    creators = doc.xpath('//dc:creator', namespaces=ns)
    return [c.text for c in creators]



find_excel_files(r'C:\Users\ayb\Desktop\tt', 'Tim')

The code for the get_xlsx_creators function was taken from this SO answer: How to retrieve author of a office file in python?

1
from pwd import getpwuid

for file in filenames:
    author = getpwuid(os.stat(file).st_uid).pw_name
    if author == '...':
        ...
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • 1
    Unless the OP meant the `Author` in the document metadata? – AChampion Aug 15 '17 at 19:53
  • True, but that field is generally not very reliable from my experience. It gets set when the file is created. If others modify the original and save, that metadata does not get updated. – Alexander Aug 15 '17 at 19:56
  • I think the OP is on Windows (C:\\dir\ALL DATA). Is pwd available on Windows? – LeopoldVonBuschLight Aug 15 '17 at 20:06
  • 1
    @LeopoldVonBuschLight, pwd is not available in Windows. Also, Python's custom `stat` implementation on Windows doesn't define `st_uid`. The equivalent code on Windows would call `GetNamedSecurityInfo` and `LookupAccountSid` via either PyWin32's win32security module or low-level ctypes. But I think the OP wants the metadata author, anyway. – Eryk Sun Aug 15 '17 at 20:15