I have several datasets based in JSON. This is because the datasets are pulled in through an API call. I can change this, if it makes sense, but it might slow down the intended function a lot.
Now, each dataset could easily 10,000 and even 100,000 records within i.e. dataset.length = 10,000.
Now in each of these datasets, they have an id column that allows me to connect them. I essentially want to select a primary dataset, and then cycle through the secondary datasets, find the first record (or even all records) where that share the same identifier. let's say primary.id = secondary.id
This is essentially like an INNER JOIN but running through with JSON and using Google Scripts.
I want to match the rows in the secondary datasets and just add them to the relevant places in the primary and output just one dataset to the file.
I wrote the following script
function mergeRecords(primary,identifiers,columns) {
const args = Array.prototype.slice.call(arguments);
args.splice(0,3);
const argsFlat = args.flat();
if (args.length<1) {
throw 'Insufficient Number of datasets, please add at least 1 secondary dataset';
}
try {
for (var i in primary) {
var parent = primary[i];
var a = 0;
var child = argsFlat[a];
while (parent[identifiers] != child[identifiers] && a<argsFlat.length) {
a++;
}
columns.forEach(function(x) {
parent['Child-'+x]=child[x];
}
);
}
return primary;
} catch(err) {
Logger.log(err);
Logger.log(primary);
Logger.log(arguments);
Logger.log(argsFlat);
Logger.log(columns);
Logger.log(identifiers);
Logger.log(a);
Logger.log(parent.id+' '+child.id);
}
}
Essentially the primary is the main dataset to compare all the others against. The identifier is the name of the key that should be the same in target secondary datasets. columns are the data properties I want to retrieve from the secondary sets.
As you can see, I've made it that arguments need to be included after the columns argument otherwise it errors out.
I am using the while method/function (honestly I'm not sure the difference, I'm still new to this) because I have found it to be quite a bit faster than using
for (var a in argsFlat) {
var child = argsFlat[a];
if (parent[identifiers] == child[identifiers]) {
columns.forEach(function(x) {
parent['Child-'+x]=child[x];
}
);
else {continue;}
even though using this would allow to append the data from the different datasets if multiple datasets have the same identifier as the parent.
This is the test source I am using:
function test() {
var primary = {
'records': [{
'id':1234445,
'name':'Yiz Segall1',
'email':'123@123.com'},
{
'id':567568,
'name':'viva1',
'email':'old@new.com1'
},
{
'id':123442,
'name':'shompie1',
'email':'shompie@gmail.com1'
}]
}
var secondary = {
'records': [{
'id':1234445,
'name':'2Yiz Segall',
'email':'2123@123.com',
'phone':12435435},
{
'id':567568,
'name':'2viva',
'email':'2old@new.com',
'phone':4445555
},
{
'id':123442,
'name':'2shompie',
'email':'2shompie@gmail.com',
'phone':5556666
}]
}
var tertiary = {
'records': [{
'id':1234445,
'name':'3Yiz Segall',
'email':'3123@123.com',
'address':'32 wisconsin avenue'
},
{
'id':567568,
'name':'3viva',
'email':'o3ld@new.com',
'address':'14 pine street'
},
{
'id':123442,
'name':'s3hompie',
'email':'3shompie@gmail.com',
'address':'15 ny street'
}]
}
var records = mergeRecords(primary.records,'id',['name','email'],secondary.records,tertiary.records);
Logger.log(records);
}
My question is 2-fold:
- While the first record in the test comes back correct, the others just pull in the first data, so how do I fix?
- this is the response I get from the test
[{email=123@123.com, Child-name=2Yiz Segall, name=Yiz Segall1, Child-email=2123@123.com, id=1234445.0}, {email=old@new.com1, name=viva1, Child-name=2Yiz Segall, id=567568.0, Child-email=2123@123.com}, {name=shompie1, email=shompie@gmail.com1, Child-name=2Yiz Segall, Child-email=2123@123.com, id=123442.0}]
- this is the response I get from the test
- If I wanted to make it that it would attach all data rows that had the same identifier as the primary row, I have a problem with naming. Right now, it uses
parent['Child-'+x]=child[x];
which only works for the first instance of this. How do I make it child1 and then child 2 and then child 3, so it's flexible enough to adapt to however many datasets I need to include.- I thought of using the dataset names, but the data is converted into an array and then flattened, so it's impossible to tell where it came from, or is there an eays way to do that?
Remember that this process plays a part of preparing the data properly to output into a file, by this point, I have already run script to make 3-4 API calls.
Any help is welcome, thanks.