-1

I have huge JSON file to process which holds around 15,00,000 JSON objects. I am performing some searching operation where I am using two for loops under which I am comparing object values.

Below is an example:

const data = [
 {
  "slug": "vertical-lift-module-market",
  "id": 68055,
  "related_reports_updated": {
  "sub_categories": [
    {
      "slug": "audience-analytics-market",
      "id": 66684,
      "short_title": "Audience Analytics Market"
    },
    {
      "slug": "mobile-wallet-market",
      "id": 68830,
      "short_title": "Mobile Wallet Market"
    }
  }
},
{
"slug": "united-states-real-estate-services---growth-trends-and-forecast-2022-- -2027",
"id": 68056,
"related_reports_updated": {
  "sub_categories": [
    {
      "slug": "canada-real-estate-services-market---growth-trends-and-forecast-2020---2025",
      "id": 68051,
      "short_title": "Canada Real Estate Services Market"
    },
    {
      "slug": "germany-real-estate-services-market--growth-trends-and-forecast-2020---2025",
      "id": 68054,
      "short_title": "Germany Real Estate Services Market"
    },
  }
 },
 {
  ...
 }  
]
//This data holds 15,00,000 JSON objects 

What I am trying to do is comparing slug of one object with slug available in sub_categories array of other objects. If it's present then create one object and push it into the result array and send that result array.

const result = [];

for(var i=0;i<data.length;i++) {
  
   for(var j=0;j<data.length;j++) {

        //Comparing operation
  }

} 

console.log(result);

But after running some time, it's giving me this error:

[41955:0x523ce90]   162238 ms: Mark-sweep (reduce) 4096.9 (4102.7) -> 4096.9 (4104.7) 
MB, 3481.7 / 0.4 ms  (average mu = 0.092, current mu = 0.000) allocation failure scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
1: 0xa3ac10 node::Abort() [node]
2: 0x970199 node::FatalError(char const*, char const*) [node]
3: 0xbba58e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) 
[node]
4: 0xbba907 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char 
const*, bool) [node]
5: 0xd76b25  [node]
6: 0xd776af  [node]
7: 0xd854eb v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, 
v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
8: 0xd890ac v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, 
v8::internal::AllocationType, v8::internal::AllocationOrigin, 
v8::internal::AllocationAlignment) [node]
9: 0xd5778b v8::internal::Factory::NewFillerObject(int, bool, 
v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
10: 0x109fd4f v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, 
v8::internal::Isolate*) [node]
11: 0x1448f59  [node]

Aborted (core dumped) 

To get rid of this error I even tried node --max-old-space-size=4096 index.js for maximizing memory for node processes.

But I am still getting the same issue. Is there any other way to resolve this issue and get the desired result?

halfer
  • 19,824
  • 17
  • 99
  • 186
Digvijay
  • 2,887
  • 3
  • 36
  • 86
  • Are you trying to extract the `names` from the objects in array by excluding duplicated `names` ? – Dilshan Jan 07 '23 at 15:02
  • `result` will always have all names from `data` because both loops iterate over all elements in `data`. That's a really wasteful way to get a copy of `data` + a bunch of empty objects... -> What are you trying to accomplish with that script ([XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/))? – Andreas Jan 07 '23 at 15:03
  • I have updated my post with actual problem statement. – Digvijay Jan 07 '23 at 15:17
  • Iterate over the elements and add all sub-slugs into a `Set` and in a second run check the slugs against that `Set`. Exit the inner loop as soon as you've found a match (there's no need to iterate over the whole array). Try to split the work into smaller chunks. Use worker threads. ... – Andreas Jan 07 '23 at 15:31

2 Answers2

0

Let's assume that any element have word and number.

  • double precision 64-bit number in js takes 8 bytes*
  • word with 5 letters is around 10 bytes**

If single element have 18 bytes then

1.5e6 elements is 2.7e7 bytes = 27 MB

Independently from code:

         const obj = {};
             
         if(data[i].name == data[j].name) {
              obj.name = data[i].name;
         }

you pushing to result array in double loop. Any of these loops goes through 1.5e6 elements so summary you are going to add to result a lot elements, in this case: 1.5e6 * 1.5e6 = 2.25e12.

Now let's compute size of result

2.25e12 * (8 bytes + 10 bytes) = 4.05e13 bytes

We know that 1 gigabyte = 1 000 000 000 bytes = 1e9 bytes 1 terabyte = 1e12 bytes

So in your case you need 40 terabytes to save your result.

Definitely you have to rethink algorithm that you want to implement.


*each number actually takes up an average of 9.7 bytes. source

**we can assume 2 bytes per character for but it can be 5 or more bytes source

Daniel
  • 7,684
  • 7
  • 52
  • 76
0

If you use a less complex algorithm that does not add duplicate values, it will push much lower number of elements into result array.

let dictionary = {};

// O(n) to build dictionary of all parent slugs
for(let i=0;i<n;i++)
{
   dictionary[data[i].slug]=1; // hashing
}

// O(n*k) to check dictionary instead of doing O(n*n*k) for brute-force
for(let i=0;i<n;i++)
for(let j=0;j<data[i].sub_cat.length;j++)
{
    // O(1)
    if(data[i].sub_cat[j].slug in dictionary)
        dictionary[data[i].sub_cat[j].slug]++; // 2 => will go to result
}


// add to result O(n)
for (const [key, value] of Object.entries(dictionary)) {
  if(value==2)
      result.push(key);
}  

Since it also compares to itself, like element 5 vs element 5, it may still duplicate as in comparison but not as result item. So you may need to add an extra comparison for that depending on required algorithm.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97