0

this is the first time I write in this site.

So I need to generate a set of random data with a function that returns an object. This object picks some properties (on really nested levels) randomly from other arrays of objects. So the function returns the same object in structure, but different values in its properties.

Is there a way to calculate a uniqueness ratio or something like that? Like if there's one generated object exactly equal to other in the set, it will return a uniqueness of 0, if there are no shared properties with any other, return a 100, and if some are shared, and others not, some percentage in between?

My goal with this is to generate a set of 100 for example and pick the top 20 most unique generated objects.

Thanks in advance for your ideas.

EDIT:

Let's assume I already generated the set of data. All objects have the same structure but different values. Something like this:

{
name: 'Some Name',
propA: (picked randomly from set A),
propB: (picked randomly from a different set B),
sections: [
  {
    propC: (another random from another set C)
   },
   {...}, 
   ...
]

}

I spawned an array of these objects with some utilities I wrote with ramda, like pick random from a list, and R.times to do it.

The main issue is that I need this:

{
  ...generatedObject,
  uniqueness: 79
}

On each object, the uniqueness is a percentage.

So far I used deep-diff To get a difference between to objects and wrote a function to extract a percentage based on the number of props that were changed in the object.

This is that fn:

// changes is a Number
const measureUniquenessBetweenTwoChildObjects = R.curry((changes, objA, objB) =>
  R.compose(
    R.multiply(100), 
    R.divide(R.__, changes), 
    R.length, 
    diff)(objA, objB)
  );

What this does is that if there is the same changes as there are generated props, then the difference is 100%.

Then I did pick every object in the list, and map this function with every other object except itself, reduce that array of differences with an average and that's what I thought the final number is. Then I attached that number to the object with R.assoc.

Inspecting the array of percentage differences gives me something like this:

[
  73.02, 73.02, 72.79, 72.56,
  72.56, 72.34, 72.34, 72.11,
  71.66, 71.66,  71.2, 70.98,
  70.98, 70.98, 70.75, 70.52,
  70.29, 70.29, 70.07, 69.84
]

Each of these are the uniqueness ratio I attach to the objects.

However I think my solution is flawed, I sense something is odd here. This was my logic to solve this problem.

What I am asking you is how would you solve this? In the end the issue is to write an algorithm that calculates a uniqueness value of each object within a set of objects of the same structure, but different values.

I'm not asking for code, just some ideas to make this work in a proper way. I'm not a data scientist or a mathematician, so I went with my naive way of achieving this.

Hope this makes it more clear.

Thanks.

user3297291
  • 22,592
  • 4
  • 29
  • 45
  • 3
    Please read [ask] and create a [mcve] of what you tried – Alon Eitan Mar 09 '20 at 18:51
  • If the data is truly randomly generated.. – user2864740 Mar 09 '20 at 18:54
  • 1
    I think there's a very interesting question lurking in here. Please edit it to make it clear what you're looking to be done and what you've tried so far. For instance, are you looking for help generating these random objects? Or do you want a formula for uniqueness that you could code yourself? Or do you want an algorithm to calculate such a uniqueness value given a set of objects? Something else? And of course, we like to see the work you've put into it yourself. – Scott Sauyet Mar 09 '20 at 19:13
  • I voted to reopen. A few more people need to do that as well. It will go onto a queue of reopen requests for people to look at. But if you'd rather create a new question, feel free. Just add a comment here pointing to it. And then close this one. – Scott Sauyet Mar 09 '20 at 20:16
  • Thanks, I'll wait a bit. Then after the edit is it more clear? – TwireonEnix Mar 09 '20 at 20:18
  • 1
    This is much more clear. Some folks may still think it's out of bounds for StackOverflow, as many really just want to be able to write code for an answer. But I disagree. I like the questtion. That matrix of results looks odd to me, as I would expect it to be symmetric around the main diagonal. – Scott Sauyet Mar 09 '20 at 20:20
  • `deep-diff` might be overkill here, as you already know that your objects have the same structure, only varying data. You might use something like `findLeafPaths` from [another answer](https://stackoverflow.com/a/58329659) and `R.path` to collect the data. I'm not sure exactly what set you're looking for, but a naive way to get a more representative set would be find the one with the max average distance from its peers, choose it, and then do the same thing for the next one, ignoring the distance from the already-chosen ones. I'm guessing there are some better, well-known algorithms for this. – Scott Sauyet Mar 09 '20 at 20:34
  • It was, each value of that matrix is the average of the diagonal matrix values in the xprod. I think there's the flaw, I really don't know why I averaged, it just made sense to me at that time. I think that's wrong now. That's why I asked this here. – TwireonEnix Mar 09 '20 at 20:35
  • Personally I'd use something like TestCheck.js to generate random data from a set of constraints and let that tool worry about having true randomised data. – customcommander Mar 09 '20 at 21:08

1 Answers1

0

Several other questions demonstrate that if you're looking for an optimized solution, this question is NP-hard. I don't know if there's any algorithm better than brute force for that. And that is out of the question as (100 choose 20) is rather large (535983370403809682970) -- unless you have some hardware I'd really like to know about!

But I think you can find a local optimum which likely wouldn't be a bad guess. That would involve

  1. Calculating the difference matrix
  2. Sum the rows
  3. Choose the maximum value
  4. Add that value to your list
  5. If you still need more items, remove that index from the row and column and go back to step 2.

Or of course you could use some simulated annealing technique to find an even better local maximum.


As to the differences, as I suggested in a comment, deep-diff might be more than you need. You might be able to use a function like this:

const findLeafPaths = (o, path = [[]]) => 
  typeof o == 'object'
    ? Object .entries (o) .flatMap (
        ([k, v]) => findLeafPaths (v, path) .map (p => [k, ...p])
      ) 
    : path

to find all the paths in a sample object, and then for each object, reduce it to an array of values by mapping R.path on these. To find a numerical difference between them should be fairly simple (I'd start with R.zipWith (R.equals) or some such.) But if deep-diff is working well for you, there's no reason to change; it simply is testing for things that, as I understand your requirements, are not going to be there.

Scott Sauyet
  • 49,207
  • 4
  • 49
  • 103
  • Can you please elaborate why this is a problem of permutations? I see it as O(n^2), as one can measure the uniqueness of each of the 100 objects against each other, resulting in 100 * 100 / 2 - 100 or 4900 measures of uniqueness. (Am presuming that the measure of uniqueness is commutative, ie A vs B is the same as B vs A, hence the halving, and that there is no reason to measure an object's uniqueness against itself, hence the -100.) With this matrix of results in hand, the measure of uniqueness for each object can be averaged and the top 20 selected... Am I missing something? – Trentium Mar 12 '20 at 23:18
  • Ah, I misunderstood. I thought you wanted the set most mutually distinct. If you want the 20 most distinct from the collection, then it should be a matter of measuring and choosing the top 20. – Scott Sauyet Mar 13 '20 at 14:43
  • Ah, I misunderstood. I thought you wanted the set most mutually distinct. If you want the 20 most distinct from the collection, then it should be a matter of measuring and choosing the top 20. – Scott Sauyet Mar 13 '20 at 14:43
  • Ah, and now I understand where the permutation discussion comes in, if the problem is one of the *set* of 20 most mutually distinct... Thanks for clarifying. – Trentium Mar 13 '20 at 21:06