python - how to match files by something in filename

Question

This is my first ask. I am still new to python so it may be that I just didn't know how to ask the question correctly and missed it on stackoverflow!

What I want: to automate checking a website for changes. I want it to send me a notification every time there is a change and tell me what that change is.

So far I have 2 separate pieces of code that work:

An API call that returns a list of results in json format. (there are 30 results in list always)
A diff tool that checks if the json files are the same, and spits out the difference if they are not.

If I run the API call by itself, it works beautifully and saves json results to a file.

If I diff each file one at a time, the diff code works and spits out the change.

I want to make them work together - the end result being that I can set up a cron job + notification and go about my life, saving time on not checking these sites unless I know there has been a change.

My idea is that I am constantly checking the most recent pull against the last pull, and so I am storing the results in a folder.

In trying to get different parts to work, I separated the old results from the new results in folders, realizing I'm not sure how to tell the code to differentiate between the old and new.

I want to iterate through the folders, find the matching old file and new file pair, make each a json object, and then diff the two.

Parts of what I've tried work, but I am stuck on how to pair the old+new file together.

here's what I'm working with:

new_files = []
old_files = []
docs = for_docs[0]

for unid in uid_list:
    with open('%s_my_results' % uid, 'w+') as outfile:
        json.dump(docs, outfile)

        for newFiles in os.walk('FILEPATH/new_files'):
            newfiles.append(newFiles)
       
        unpack_newFiles = sorted(newfiles[2])

os.chdir('FILEPATH/old_files'):

for oldfiles in os.walk('FILEPATH/old_files'):
    old_files.append(oldfiles[2])

for fname in unpack_oldFiles:
    if fname.endswith('.json'):
       with open(fname, mode='rb+') as oldFile:
           try:
               unpack_oldFiles = json.load(oldFile)
           except json.decoder.JSONDecodeError:
                continue

This works - but the unpacked json object is still an unsorted list of json objects i think. So I'm definitely confused here, and trying to extricate myself from the knot.

the reason I used sorted was hoping that i could just force them to match in order, because they will always download in the same order. I think I found sorted was not the right tool but I have definitely confused myself out of a solution.

this is code that works to diff my json files:

    with open('FILEPATH/old_file.json') as f:
        old_docs = json.load(f)
    
    with open('FILEPATH/new_file.json') as fc:
        new_docs = json.load(fc)
    
    # compare the two objects 
    
    thing = (old_docs==new_docs)
    
    # log time and result 
    
    if thing is not True:
        with open('logfile.txt', 'a+') as sys.stdout:
            print(f'{date} this item was added:  ')
            print((DeepDiff(old_docs,new_docs)))
            sys.stdout.close()
    if thing is True:
        with open('logfile.txt', 'a+') as sys.stdout:
            print(f'{date} No Change') 
            sys.stdout.close()

I know what I want, which is:

#for file in list: 
    # if uid in file name matches:
        # decode each file to json 
        # diff the two files 
        # spit out the result

To that end, I started writing variations of the below and I am definitely missing something. I found fnmatch but I am not sure how to use it.

for fname in folder 1, folder2:
   if UID-in-filename matches: # I do not know how to set this up
       thing = (oldfile == newfile)
       if thing is not True:
       with open('logfile.txt', 'a+') as sys.stdout:
          print(f'{date} {UID} this item was added:')
          print((DeepDiff(oldfile, newfile)
          print(no change)
        if thing is True:
       with open('logfile.txt', 'a+') as sys.stdout:
          print(f'{date} {UID} no change')
          sys.stdout.close()

I hope I have done justice for my first ask. Thanks to all!

Maybe try `for fname1, fname2 in file_list1, file_list2: if fname1 == fname2: {your code}`? This is assuming each uid is the file name like 'uid_num.json' from what I gathered from your code.. — Jesse_mw, Jul 10 '20 at 03:40
*Edit* To get the file name of each file you can follow this [question](https://stackoverflow.com/questions/678236/how-to-get-the-filename-without-the-extension-from-a-path-in-python?rq=1) from stackoverflow, and change each fname to it's base filename and then do the code above, I can provide a full answer if you want further example — Jesse_mw, Jul 10 '20 at 03:46
Would that match the files to one another, or just pull the name? I'm not sure how I would implement. Would love to learn more if you're willing! thanks much. — Iris D, Jul 10 '20 at 19:55

score 0 · Accepted Answer · answered Jul 10 '20 at 04:07

So if I understand you correctly you have something have a directory structure something like this:

data_files/
├── new_data
│   ├── data_file_1970_01_01_e24520c7-94ef-41c6-94b3-a16049b0d882.json
│   ├── data_file_1970_01_03_827a591b-8d10-4f8e-b55d-5a36bdaa96d7.json
│   └── data_file_1971_01_02_18bfab97-aeb9-476e-9332-94f4bb30157b.json
└── old_data
    ├── old_name_1970_01_01_e24520c7-94ef-41c6-94b3-a16049b0d882.json
    ├── old_name_1970_01_03_827a591b-8d10-4f8e-b55d-5a36bdaa96d7.json
    └── old_name_1971_01_02_18bfab97-aeb9-476e-9332-94f4bb30157b.json

where you have folders full of different json files that don't share an exact name, but do share a uuid somewhere in the name, you need to read in both files that have the same uuid and then run your diff program on them. I would do it something like this:

import json
import os
import re

from pprint import pprint


uuid_regex = re.compile(r'[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}')


def parse_directory(uuid_dict, filelist, key):
    for file in filelist:

        uuid_matcher = uuid_regex.findall(file)

        # check that a uuid was found in the input filename
        if (uuid_matcher):

            uuid = uuid_matcher[0]

            # dict.get returns existing sub dictionary if found, or defaults to a new dictionary
            per_uuid_subdict = uuid_dict.get(uuid, dict())
            per_uuid_subdict[key] = file

            uuid_dict[uuid] = per_uuid_subdict


old_files = [os.path.abspath(os.path.join('data_files/old_data', i)) for i in os.listdir('data_files/old_data') if i.endswith('.json')]
new_files = [os.path.abspath(os.path.join('data_files/new_data', i)) for i in os.listdir('data_files/new_data') if i.endswith('.json')]

uuid_dict = dict()
parse_directory(uuid_dict, new_files, 'new')
parse_directory(uuid_dict, old_files, 'old')

This uses regex to add the uuid to a dictionary in order to create a mapping of uuid to files which have that uuid in them. An example of what that looks like...

pprint(uuid_dict)
# prints:
# {'18bfab97-aeb9-476e-9332-94f4bb30157b': {'new': '/path/to/file/data_files/new_data/data_file_1971_01_02_18bfab97-aeb9-476e-9332-94f4bb30157b.json',
#                                           'old': '/path/to/file/data_files/old_data/old_name_1971_01_02_18bfab97-aeb9-476e-9332-94f4bb30157b.json'},
#  '827a591b-8d10-4f8e-b55d-5a36bdaa96d7': {'new': '/path/to/file/data_files/new_data/data_file_1970_01_03_827a591b-8d10-4f8e-b55d-5a36bdaa96d7.json',
#                                           'old': '/path/to/file/data_files/old_data/old_name_1970_01_03_827a591b-8d10-4f8e-b55d-5a36bdaa96d7.json'},
#  'e24520c7-94ef-41c6-94b3-a16049b0d882': {'new': '/path/to/file/data_files/new_data/data_file_1970_01_01_e24520c7-94ef-41c6-94b3-a16049b0d882.json',
#                                           'old': '/path/to/file/data_files/old_data/old_name_1970_01_01_e24520c7-94ef-41c6-94b3-a16049b0d882.json'}}

And from there its just a matter of iterating over the results.

for uuid, filelist in uuid_dict.items():

    if len(filelist) != 2:

        print('Too many files to diff for uuid: {}'.format(uuid))
        continue

    try:
        with open(filelist['new'], 'r') as file_handler:
            new_file = json.load(file_handler)
    except json.decoder.JSONDecodeError:
        continue

    try:
        with open(filelist['old'], 'r') as file_handler:
            old_file = json.load(file_handler)
    except json.decoder.JSONDecodeError:
        continue

    # DeepDiff handling logic around hereish
    DeepDiff(old_file, new_file)

whoa! opening my eyes to regex! I am reading a bit more to understand it, and will report back on how it works out. Thank you so much for your response. — Iris D, Jul 10 '20 at 20:38
this worked exceptionally well. Thank you so much. Your way is much more pythonic than mine so I am still wrapping my head around it, but I can say it definitely works. Thank yoy so much. — Iris D, Jul 13 '20 at 22:47

python - how to match files by something in filename

1 Answers1