0

I have a pretty peculiar problem on my hands. I'm not too experienced with python (my language of choice being swift for mobile development), but what I have to do for this project is to pull some csv files from a database, download them locally and upload them to Amazon's DynamoDB.

I have managed to get everything working - The program downloads the csv file as a zip, extracts it using zipfile, converts the csv file to a json file, and then begins uploading the json to DynamoDB.

However, these csv files contain around 100,000 rows each, and to reupload each item every time makes no sense when only 5-10 items are changed in the csv file daily. So, what I've decided to do is before uploading the new json to DynamoDB, get the program to compare the new json to the old json, get only the new items, and upload those.

Now, to get on to the actual problem. What i've been attempting is this:

import json

    with open ("C:\\Users\Me\Desktop\staff\oldfile.json") as json1:
        list1 = json.load(json1)
    with open ("C:\\Users\Me\Desktop\staff\newfile.json") as json2:
        list2 = json.load(json2)

set_1 = set(repr(x) for x in list1)
set_2 = set(repr(x) for x in list2)

differences = (set_2 - set_1)
print(differences)

Which actually works pretty well. The result will be set() if the sets are identical, or contain only the new additional items.

However

I have noticed that when I convert the csv file to json, the orders of the sets change between the two objects in the different files. For example, in the first json file an object might be:

[{"name": "jack", "id": "3100", "photo": "http://imagesdatabase.com/is/image/jack/I_063017263_50_20141112", "category": "male employees", "commissions": "4500", "department": "Beauty > Skincare", "department_id": "709010788", "store_id": "", "additional duties": "5", "spreadsheet": "http://spreadsheetdatabase.com/previpew/01/32100/88/07/709310788.csv", "description": "Jack is a talented young man, has worked with us for over three years and, although initially starting slowly, has worked his way up to becoming the top earner of the month several times.", "join_date": "12/5/2008", "mornings": "YES", "staff_link": "http://staffdatabase.com/244234/654", "show": "NO", "retailers_id": "6017263", "head_id": "2909", "products_sold": "Skincare", "commissions_report": "http://commissionsdatabase.com/jck1/2453"}]

This same object in the new json file might be:

[{"id": "3100", "name": "jack", "photo": "http://imagesdatabase.com/is/image/jack/I_063017263_50_20141112", "category": "male employees", "commissions": "4500", "department": "Beauty > Skincare", "department_id": "709010788", "store_id": "", "additional duties": "5", "spreadsheet": "http://spreadsheetdatabase.com/previpew/01/32100/88/07/709310788.csv", "description": "Jack is a talented young man, has worked with us for over three years and, although initially starting slowly, has worked his way up to becoming the top earner of the month several times.", "join_date": "12/5/2008", "mornings": "YES", "staff_link": "http://staffdatabase.com/244234/654", "show": "NO", "retailers_id": "6017263", "head_id": "2909", "products_sold": "Skincare", "commissions_report": "http://commissionsdatabase.com/jck1/2453"}]

These are both still the same object, no?

But when I try to compare these two using python sometimes I get set(), and sometimes it tries to tell me that it's a new object - what's happening?

json comparison fail

Honestly, I've been troubleshooting this for almost a whole day now and I'm pretty much at my wit's end - I really can't understand why it would work when I run it once, and then not the next time with the same exact json objects. Any help would be greatly appreciated!

dan martin
  • 1,307
  • 3
  • 15
  • 29
  • You are converting *dictionaries* to strings. Dictionary order depends on the order of insertions and deletions and subject to randomisation of hashes. – Martijn Pieters Aug 14 '16 at 15:35
  • I've named those variables pretty badly - string1 and string 2 are actually lists. How would I go about loading the json file as a dictionary? – dan martin Aug 14 '16 at 15:36
  • `string1` and `string2` may be lists, but their **contents** are dictionaries. – Martijn Pieters Aug 14 '16 at 15:37
  • Why are you converting those dictionaries to strings in the first place? If you want to find unique dictionaries, you'll have to use `repr(sorted(x.items()))` to eliminate ordering issues. Or just store `tuple(sorted(x.items()))`. – Martijn Pieters Aug 14 '16 at 15:38
  • I see, so I should be attempting to load the json as a dictionary, and then convert those to sets and compare them? Does that sound plausible in any way? - Thanks – dan martin Aug 14 '16 at 15:38
  • @danmartin Try this: https://paste.fedoraproject.org/408222/47118921/ – Nehal J Wani Aug 14 '16 at 15:40
  • @NehalJWani: that requires the JSON files to have been written with a specific ordering too. You can't rely on that either, unless OrderedDict was used to generate them. – Martijn Pieters Aug 14 '16 at 15:41
  • @MartijnPieters Ah, yes. You are right. – Nehal J Wani Aug 14 '16 at 15:42
  • I tried the snippet of code you provided, but unfortunately it's still exhibiting the same behavior. Thanks, though. – dan martin Aug 14 '16 at 15:45
  • @danmartin: then provide some sample input with which to reproduce your issue. – Martijn Pieters Aug 14 '16 at 15:49
  • There are a few things that strike me as odd here. 1. Why are your JSON objects stored in a list? Are there more than one JSON object in your files? 2. Comparing dicts works out of the box just fine, why are you converting it to a string using repr()? This causes your bug, because dictionary comparison doesn't depend on the dictionaries order, string comparison does. – Luca Fülbier Aug 14 '16 at 15:50
  • @LucaFülbier: Why is storing multiple mappings in a list odd? JSON top-level objects can be lists. – Martijn Pieters Aug 14 '16 at 15:51
  • @LucaFülbier: you can't store a dictionary in a set, because dictionaries are not immutable. – Martijn Pieters Aug 14 '16 at 15:51
  • @MartijnPieters I think i am trying to understand what the OP is trying to achieve here. I was confused, because his sample data was just a single dictionary put in a list, which seemed unnecessary. My second question doesn't really make sense in the context anymore, as hashing is a lot faster than dict-comparison when applied to a lot of dictionaries. – Luca Fülbier Aug 14 '16 at 16:01

1 Answers1

2

Your code relies on the ordering of dictionaries. Dictionary order depends on insertion and deletion history, varies between Python interpreter runs thanks to hash randomisation and should not be relied upon.

If your dictionaries are not nested, you can store them in sets as tuples of their key-value pairs, sorted:

set_1 = set(tuple(sorted(x.items())) for x in list1)
set_2 = set(tuple(sorted(x.items())) for x in list2)

This creates an immutable representation that retains the original key-value pairing but avoids any issues with ordering. These tuples can trivially be fed back into the dict() type to re-create the dictionary.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343