3

TL;DR version: I want to avoid adding duplicate Javascript objects to an array of similar objects, some of which might be really big. What's the best approach?

I have an application where I'm loading large amounts of JSON data into a Javascript data structure. While it's a bit more complex than this, assume that I'm loading JSON into an array of Javascript objects from a server through a series of AJAX requests, something like:

var myObjects = [];

function processObject(o) {
    myObjects.push(o);
}

for (var x=0; x<1000; x++) {
    $.getJSON('/new_object.json', processObject);
}

To complicate matters, the JSON:

  • is in an unknown schema
  • is of arbitrary length (probably not enormous, but could be in the 100-200 kb range)
  • might contain duplicates across different requests

My initial thought is to have an additional object to store a hash of each object (via JSON.stringify?) and check against it on each load, like this:

var myHashMap = {};

function processObject(o) {
    var hash = JSON.stringify(o);
    // is it in the hashmap?
    if (!(myHashMap[hash])) {
        myObjects.push(o);
        // set the hashmap key for future checks
        myHashMap[hash] = true;
    }
    // else ignore this object
}

but I'm worried about having property names in myHashMap that might be 200 kb in length. So my questions are:

  • Is there a better approach for this problem than the hashmap idea?
  • If not, is there a better way to make a hash function for a JSON object of arbitrary length and schema than JSON.stringify?
  • What are the possible issues with super-long property names in an object?
nrabinowitz
  • 55,314
  • 10
  • 149
  • 165
  • do you control the server? any way to add a unique ID to your objects that you can key off of? – NG. Jul 06 '11 at 21:36
  • I agree with SB. Some sort of unique key per object would make this trivial. Can the problem be rethought at the origin of the data to create such a key? If that can't be easily done, can you identify a small subject of properties of the object that uniquely identify it such if they are the same, then you can consider the object the same and make your hash out of just that subset of properties? – jfriend00 Jul 06 '11 at 21:38
  • @SB, @jfriend00 - a unique ID would make this much easier, but for various reasons it's not feasible. Assume I don't control the server and that the schema of the object is entirely black-boxed (again, it's a little more complicated, but this is essentially the case). – nrabinowitz Jul 06 '11 at 21:49

2 Answers2

4

I'd suggest you create an MD5 hash of the JSON.stringify(o) and store that in your hashmap with a reference to your stored object as the data for the hash. And to make sure that there are no object key order differences in the JSON.stringify(), you have to create a copy of the object that orders the keys.

Then, when each new object comes in, you check it against the hash map. If you find a match in the hash map, then you compare the incoming object with the actual object that you've stored to see if they are truly duplicates (since there can be MD5 hash collisions). That way, you have a manageable hash table (with only MD5 hashes in it).

Here's code to create a canonical string representation of an object (including nested objects or objects within arrays) that handles object keys that might be in a different order if you just called JSON.stringify().

// Code to do a canonical JSON.stringify() that puts object properties 
// in a consistent order
// Does not allow circular references (child containing reference to parent)
JSON.stringifyCanonical = function(obj) {
    // compatible with either browser or node.js
    var Set = typeof window === "object" ? window.Set : global.Set;

    // poor man's Set polyfill
    if (typeof Set !== "function") {
        Set = function(s) {
            if (s) {
                this.data = s.data.slice();
            } else {
                this.data = [];
            }
        };
        Set.prototype = {
            add: function(item) {
                this.data.push(item);
            },
            has: function(item) {
                return this.data.indexOf(item) !== -1;
            }
        };
    }

    function orderKeys(obj, parents) {
        if (typeof obj !== "object") {
            throw new Error("orderKeys() expects object type");
        }
        var set = new Set(parents);
        if (set.has(obj)) {
            throw new Error("circular object in stringifyCanonical()");
        }
        set.add(obj);
        var tempObj, item, i;
        if (Array.isArray(obj)) {
            // no need to re-order an array
            // but need to check it for embedded objects that need to be ordered
            tempObj = [];
            for (i = 0; i < obj.length; i++) {
                item = obj[i];
                if (typeof item === "object") {
                    tempObj[i] = orderKeys(item, set);
                } else {
                    tempObj[i] = item;
                }
            }
        } else {
            tempObj = {};
            // get keys, sort them and build new object
            Object.keys(obj).sort().forEach(function(item) {
                if (typeof obj[item] === "object") {
                    tempObj[item] = orderKeys(obj[item], set);
                } else {
                    tempObj[item] = obj[item];
                }
            });
        }
        return tempObj;
    }

    return JSON.stringify(orderKeys(obj));
}

And, the algorithm

var myHashMap = {};

function processObject(o) {
    var stringifiedCandidate = JSON.stringifyCanonical(o);
    var hash = CreateMD5(stringifiedCandidate);
    var list = [], found = false;
    // is it in the hashmap?
    if (!myHashMap[hash] {
        // not in the hash table, so it's a unique object
        myObjects.push(o);
        list.push(myObjects.length - 1);    // put a reference to the object with this hash value in the list
        myHashMap[hash] = list;             // store the list in the hash table for future comparisons
    } else {
        // the hash does exist in the hash table, check for an exact object match to see if it's really a duplicate
        list = myHashMap[hash];             // get the list of other object indexes with this hash value
        // loop through the list
        for (var i = 0; i < list.length; i++) {
            if (stringifiedCandidate === JSON.stringifyCanonical(myObjects[list[i]])) {
                found = true;       // found an exact object match
                break;
            }
        }
        // if not found, it's not an exact duplicate, even though there was a hash match
        if (!found) {
            myObjects.push(o);
            myHashMap[hash].push(myObjects.length - 1);
        }
    }
}

Test case for jsonStringifyCanonical() is here: https://jsfiddle.net/jfriend00/zfrtpqcL/

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • is there a good, fast Javascript MD5 implementation that you know of? – nrabinowitz Jul 06 '11 at 21:50
  • There's a jQuery plug-in: http://plugins.jquery.com/project/md5. Google shows many other options. I don't know which are better. – jfriend00 Jul 06 '11 at 21:57
  • Sorry, that was a bit of a "please Google it for me" question. This looks like a good option - I like the two-layer dupe check. – nrabinowitz Jul 06 '11 at 22:00
  • For future reference: http://stackoverflow.com/questions/1655769/fastest-md5-implementation-in-javascript – nrabinowitz Jul 06 '11 at 22:03
  • I just realized this routine has a flaw in it when there's a hash collision. I'll update it. – jfriend00 Jul 07 '11 at 00:36
  • Good call. I think this could be simplified a little bit - I think it's fine memory-wise to store a reference to the actual object in `list`, instead of its position in the object array, as long as I'm not trying to free up the memory before the end of the script; and you could just `return` if we find an exact match. But otherwise, this looks very solid - thanks! – nrabinowitz Jul 07 '11 at 01:08
  • If you assume `{a: 5, b: 6}` and `{b: 6, a: 5}` are identical, then **JSON.stringify** is not reliable for checking their identicalness. Because `JSON.stringify({a: 5, b: 6}) === JSON.stringify({b: 6, a: 5})` evaluates to false. – hswner Dec 23 '15 at 15:56
  • @hswner - You are correct. I fixed that issue with a `jsonStringifyCanonical()` function that creates a copy of the object in a predictable key order. – jfriend00 Dec 23 '15 at 21:17
  • @jfriend00 thumbs up! nice fix. but indeed, there is still one more thing to improve. Take those two objects: `var x = {x: 9, y: 10, z: 11, a: [{m: 1, n: 2}, {k: 5}]};` `var y = {y: 10, z: 11, a: [{k: 5}, {n: 2, m: 1}], x: 9};` if a value associated with a key is an array then we need to sort the values inside of that array. But, if any of the those values is object (not an array), then we should sort it's keys, then check the values whether they are arrays or object, and so on recursively. I tried to hack ur fiddle but couldn't come up to a solution. I'll check back later when I have time. – hswner Dec 24 '15 at 13:35
  • @hswner - Arrays themselves should not be sorted. Two objects that contain an array whose items are in a different order are NOT the same. Order is relevant in arrays. I see that I do need to traverse arrays and find any objects in them to apply the transformation to them. I will work on that later today. – jfriend00 Dec 24 '15 at 16:39
  • @hswner - I updated my answer to handle objects embedded in arrays. And, I also added circular reference detection to prevent infinite loops. – jfriend00 Dec 24 '15 at 20:59
  • @jfriend00 yep, you're right that order of elements in an array matters. in some cases I use arrays to hold objects where ordering does not make sense. That's why I thought it would be good to have an option for such cases. Thanks for your kind efforts! – hswner Dec 25 '15 at 11:08
2
  1. Maybe. For example if You know what kind object goes by You could write better indexing and searching system than JS objects' keys. But You could only do that with JavaScript and object keys are written in C...
  2. Must Your hashing be lossless or not? If can than try to lose compression (MD5). I guessing You will lose some speed and gain some memory. By the way, do JSON.stringify(o) guarantees same key ordering. Because {foo: 1, bar: 2} and {bar: 2, foo: 1} is equal as objects, but not as strings.
  3. Cost memory

One possible optimization:

Instead of using getJSON use $.get and pass "text" as dataType param. Than You can use result as Your hash and convert to object afterwards.

Actually by writing last sentence I though about another solution:

  • Collect all results with $.get into array
  • Sort it with buildin (c speed) Array.sort
  • Now You can easily spot and remove duplicates with one for

Again different JSON strings can make same JavaScript object.

petraszd
  • 4,249
  • 1
  • 20
  • 12
  • Good point about the key order.`JSON.stringify` doesn't guarantee the key order, but I'm pretty sure key order will be the same for identical JSON strings - so as long as the server provides identical strings for duplicate data, which I think is the case, I should be okay. – nrabinowitz Jul 06 '11 at 22:22