12

I have a JSON file with 28 million strings, the size is roughly 15 GB. Some of the strings are duplicates, so I need to create a new JSON file with just the unique ones. I'm guessing 240 million of them are unique, but I need to find out the exact number. The strings are all less than 100 characters. Here is an example of the data:

[
    '4zWMS2IHAKcsrVtrUBFXIFjkwvbiCyiK',
    'btFqRsglI1Dh81jpgmnRhKPGIBbe2cU7',
    '8Us6mE6zWfyOpjhXsJssE65LrOFc7yr6',
    ...
]

My first approach was to create a JavaScript object and set all of the keys of the object to the strings. Then I would check the length of the keys and that would be my unique count. Unfortunately, I ran into a limit and JavaScript objects can only have ~8M keys.

My next approach was to create a new array in JavaScript and then iterate through my strings and then use the .indexOf method to see if I already added the string to my array. Unfortunately, this is way too slow.

Can anyone think of a way I can do this in JavaScript? I am also OK switching to a different language if this is not the right tool for the job.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Kirk Ouimet
  • 27,280
  • 43
  • 127
  • 177
  • 2
    Does this file contain nested values? Could you include a small example of how it's structured? – Konrad Nov 05 '22 at 17:11
  • 3
    The biggest problem is that JSON isn't the right tool for the job. You might want to consider storing the data in a different way. – Ivar Nov 05 '22 at 17:13
  • 1
    @KonradLinkowski updated my question with the example data – Kirk Ouimet Nov 05 '22 at 17:14
  • 5
    I would just ignore first and last line and use `uniq` in bash https://stackoverflow.com/questions/23740545/how-to-print-only-the-unique-lines-in-bash, although `sort` will be slow – Konrad Nov 05 '22 at 17:16
  • I ***highly*** doubt sort will work on this. – luk2302 Nov 05 '22 at 17:18
  • 4
    I would suggest to partition the data into smaller chunks (e.g. by bucketing based on the string or an hash of it, basically mod the hash by 25 or even more partitions), store them on disk (either in a single file each or multiple files) and then deduplicate the smaller chunks individually. – luk2302 Nov 05 '22 at 17:19
  • You can also go line by line and store unique ones to a new file and then check if the next one is already in the file – Konrad Nov 05 '22 at 17:20
  • I'm questioning why you would want to dedupe 250 million strings with an expected 240 million strings being unique. That's like.. 4% - still an interesting problem to solve though. – Marco Nov 05 '22 at 17:20
  • Looks like someone made a BigMap class. Maybe that will work. It looks like it creates another Map once the key limit is reached. Also checks all the maps before inserting a new value. https://gist.github.com/josephrocca/44e4c0b63828cfc6d6155097b2efc113 – sissonb Nov 05 '22 at 17:24
  • 2
    @sissonb Ahh.. it's great to see `throw new Error("haven't implemented this yet");` in a code that is +2 years old. – Marco Nov 05 '22 at 17:25
  • Do you have a system that can hold this amount of data in memory or double the data if we are conservative? Is this primarily coding constrained, e.g. the size of primitive maps. Or is this actually resource constrained because you want an algorithm that can also handle 2.5 billion strings or 25 billion, ... or only have a system with 10GB of RAM currently!? – luk2302 Nov 05 '22 at 17:27
  • @marco-a I don't think those methods are getting implemented. But it has the methods needed to solve this problem. – sissonb Nov 05 '22 at 17:27
  • 7
    If you can import this into a rdbms, most have direct support for json and could run a count(distinct) on a rowset. – Stu Nov 05 '22 at 20:07
  • 2
    Does it have to be JS? I'd recommend using [`jq`](https://stedolan.github.io/jq/), a simple `jq 'unique' input.json` should do the job. Assuming the data does not fit into your RAM, use a RDBMS, which are tailored to automatically use appropriate on-disk algorithms if the data is too large for in-memory processing. – Bergi Nov 06 '22 at 15:05
  • If you really want to use JavaScript, the right approach would be to construct a `Set` from the input – Bergi Nov 06 '22 at 15:07
  • 2
    Title and initial description says "28 million" strings, but then you say you expect that to have about "240 million" uniques. Did you mean "28 billion" for the input size? – Matthew Herbst Nov 11 '22 at 19:40
  • 1
    `sort | uniq` in a shell script absolutely positively works on this dataset. And it would have worked on a computer with 1 megabyte of RAM. (And lots of disk space). See Volume 3 of Donald Knuth's "The Art of Computer Programming." The output, of course, won't be clean JSON. – O. Jones Nov 12 '22 at 11:33

6 Answers6

5

Computers are insane. I was inspired by the comments to shard out the keys to many objects so I would not hit the 8M key limit in JavaScript. I then used GitHub Copilot to write this code in about 30 seconds, just using comments and having Copilot write the rest:

// Find the files we want to count
let files = NodeFileSystem.readdirSync('data/imported');
let keyArray = [];
let keyArrayLength = 0;

// Add all of the keys to an array
for(let file of files) {
    if(file.endsWith('.json')) {
        console.log('Parsing file', file);
        let data = JSON.parse(NodeFileSystem.readFileSync('data/imported/'+file));

        // Add the data item.key to the array
        for(let item of data) {
            keyArray.push(item.key);
            keyArrayLength++;
        }
    }
}
console.log('Total array length:', keyArrayLength);

// JavaScript will only allow us to have 8 million keys in an object, so we need to shard the array
// into several objects, using the first characters of each key
let characterCountToUseForSharding = 2;

// An object to store the sharded objects
let shardedObjects = {};

// Loop through the key array
let processedCount = 0;
for(let key of keyArray) {
    processedCount++;
    if(processedCount % 1000000 === 0) {
        let processCountWithCommas = processedCount.toLocaleString();
        console.log('Processed', processCountWithCommas, 'of', keyArrayLength.toLocaleString());
    }

    // Get the first characterCountToUseForSharding characters of the key
    let shardingKey = key.substring(0, characterCountToUseForSharding);

    // If the sharded object doesn't exist, create it
    if(!shardedObjects[shardingKey]) {
        shardedObjects[shardingKey] = {};
        // console.log('Created sharded object', shardingKey);
    }

    // Add the key to the sharded object
    shardedObjects[shardingKey][key] = true;
}

// Count the keys in each sharded object
let total = 0;
for(let shardingKey in shardedObjects) {
    let keyCount = Object.keys(shardedObjects[shardingKey]).length;
    console.log('Sharding key', shardingKey, 'has', keyCount, 'keys');
    total += keyCount;
}

// Log the total
console.log('Total keys:', keyArrayLength);
console.log('Total unique keys:', total);

// Percentage
let percentage = (total / keyArrayLength) * 100;
console.log('Unique percentage:', percentage);

// Duplicate keys
let duplicateKeys = keyArrayLength - total;
console.log('Duplicate keys:', duplicateKeys);

Here you can see the output and speed:

https://www.youtube.com/watch?v=RurzxMjbazg

I am running it on an M1 MacBook Pro and just amazed at the speed. The fan did not even turn on.

Kirk Ouimet
  • 27,280
  • 43
  • 127
  • 177
  • 3
    Your video shows an array length of `28335100` which is 28 million, not even close to 250 million? – Marco Nov 05 '22 at 17:46
  • 1
    I thought that node will run out of memory, but I guess you have a lot of ram – Konrad Nov 05 '22 at 17:48
  • 1
    @marco-a that was against a partial dataset, running it against the full dataset now and it is taking much longer LOL – Kirk Ouimet Nov 05 '22 at 17:57
  • You might want to look at SparkSQL if you're going to split your dataset into multiple files anyway. Javascript is not commonly used for such problems – OneCricketeer Nov 06 '22 at 13:32
  • 2
    `keyArrayLength`, really? Seems like Copilot is not appropriate when it takes you longer to verify the solution it generated than to write the code yourself. – Bergi Nov 06 '22 at 15:08
4

I don't have a ready-to-use solution, but I can drop here a bunch of advice that could drive you to an efficient solution:

  1. If you are using NodeJS, consider writing a C++ addon. Indeed, with C++ you can write low-level code that, according to your capacity, could lead to faster and more efficient software. Writing C++ code will allow you to use the correct data structure for this task, manage memory efficiently and overcome many of the NodeJS limitations

  2. Work with file units. I think that the file system is the key to your problem. In other words, you can apply an on-disk sorting to your file (e.g. the Linux sort command), then you can split your 15GB file into, let's say, 200MB chunks. You can now load each chunk with NodeJS and remove duplicates, paying attention that the last row of the nth file can find a duplicate in the (n+1)-th file.

  3. (bonus) Exploit all the threads. Even if this alone won't solve the memory problems, when combined with the multi-chunks solution it will lead to faster code

3

I would suggest importing into DB, MongoDB or PostgreSQL will do just fine.

But if you really need js solution, here is an implementation of ShardSet(), storing strings into configurable buckets (256 as default).

Tested on 5M strings with 250MB data.json, it took ~15s with deno, and ~86s with sort | uniq.

% ls -lah data.json 
-rw-r--r--  1 me  staff   250M 11 16 16:50 data.json

% time ./scripts/distinct.sh
5242880
./scripts/distinct.sh  14.49s user 1.68s system 117% cpu 13.717 total

% time ./scripts/sort-uniq.sh 
 5242882
./scripts/sort-uniq.sh  86.34s user 0.99s system 106% cpu 1:22.36 total

Checkout this repo to see if it suits your need: deno-demo-large-json

Note: The string hash method taken from v8 is quite good. Random dataset hashed into buckets evenly.

OpenGG
  • 4,345
  • 2
  • 24
  • 31
  • can you describe what exactly mean with a specific *function, class or **hash method*** taken from **`v8`**?; I have found some resources but did not get anything related that points to it. – user11717481 Dec 03 '22 at 17:15
  • https://262.ecma-international.org/6.0/#sec-uint8array – user11717481 Dec 03 '22 at 17:17
  • @user11717481 `shardSet.hash(string)` returns int number, which is the hash code of input string. The hash algorithm was taken from v8. – OpenGG Dec 07 '22 at 10:01
  • 1
    @user11717481 But why `shardSet.hash(buff)` instead of `shardSet.hash(string)`? Because calculating hash code with `u8Arr[i]` was way faster than `str.charCodeAt(i)`. – OpenGG Dec 07 '22 at 10:04
1

Since you said, you've loaded your data and created an Array. You could try this approach.

You can play with the TOTAL_CHUNK variable to check if it provides some help.

N.B. I've never worked with such huge amount of JSON data.

const data = [
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
    15, 16, 17, 18, 19, 20,
];
const [store, TOTAL_CHUNK] = [[], 18];
let [count, index, uniqueCount] = [0, 0, 0];

while (count < data.length) {
    const currentIndex = index % TOTAL_CHUNK;
    const item = data[index];

    if (store[currentIndex] === undefined) store[currentIndex] = new Set();

    let isUnique = true;
    for (let i = 0; i < TOTAL_CHUNK; i++) {
        if (store[i]?.has(item)) {
            isUnique = false;
            break;
        }
    }

    if (isUnique) {
        store[currentIndex].add(item);
        uniqueCount++;
    }

    count++;
    index++;
}

console.log(uniqueCount);
Sifat Haque
  • 5,357
  • 1
  • 16
  • 23
0

Others have handled the "javascript isn't really the best thing for this" aspect, but I'll try to answer the question directly, without opinion about context:

Have you tried constructing a Set from the array? It's likely this will take a Good Long While, but I doubt there's any JS solution that won't take a Good Long While.

borbulon
  • 1,391
  • 1
  • 11
  • 11
0

You can do this with a custom hash function and a custom hash map. Basically to get around memory size limits in the runtime, you're gonna want to limit the hash map size, let's say to 64KB. That means your hash needs to be 16-bit. So:

  1. Create a string -> 16 bit integer hash function
  2. Create an array of string arrays with 2^16 entries
  3. For each string in the input:

3.1 Compute the hash, this becomes your array index

3.2 Lookup in your array at the computed index

3.3 If it's empty, simply add the string

3.4 If it's not empty, compare the string to all others in that bucket (normal linear search) and only add it if not present

You may need to adjust the hash table size until you find the performance sweet spot. 64KB lead to roughly 500 strings in each bucket which is a lot of collisions, but still orders of magnitude faster than doing 28 million comparisons on every string.

Asik
  • 21,506
  • 6
  • 72
  • 131