2

I wish to share a dataset (largely time-series data) with a group of data scientists to explore the statistical relationships within the data (e.g. between variables). However, for confidentiality reasons, I am unable to share the original dataset and so I was wondering if I may be able to transform the data with some random transformation that I know but that the recipients won't. Is this a common practice? Is there an associated R package?

I have been exploring the use of synthetic datasets, and have looked at 'synthpop' but I have a challenge that seems slightly different. For example, I don't necessarily want the data to include fictional individuals that resemble the original file. Rather I'd prefer the value associated with a specific variable to be unclear (e.g. still numerical but also nonsensical) to the human viewer but still enable statistical analysis (e.g. despite the actual values being unclear, the relationships between variable 'x' and 'y' remain the same).

I have a feeling that this is probably quite a simple process (e.g. change names of variables, apply the same transformation across all variables), but I'm not a mathematician/statistician and so I don't want to violate underlying relationships through an inappropriate transformation.

Thanks!

rob99985
  • 157
  • 9
  • Sounds like what you want to do is to generate random data which has the same statistical profile. The danger of course is that the generated data might differ in some important way from the original. For the same reason, it is doubtful that any transformation of the data really preserves the relationships. At best, it will preserve the relationships that you deem important, but perhaps you are mistaken about what is really important. In any event, this is more of a methodology question. It would be better to ask it on [statistics.se] – John Coleman Aug 14 '20 at 19:41
  • 1
    Not knowing exactly what the confidentiality issue is makes this difficult so maybe I am oversimplifying, but could you just change the date data to be a standardized date. For example, you could subtract 1995-03-03 from all the dates. The time data they would analyze would be the number of days since the date you subtracted. Relationships would remain the same, but they would have no temporal context of where the data came from. You would just add 1995-03-03 to the dates to return it to your prior dataset. You could also change the variable names easily. – Tanner33 Aug 14 '20 at 19:59
  • 1
    @Tanner33 that is a clever idea, but if security is really an issue then that would be trivial to brute-force. More generally, any linear transformation would just add obscurity rather than true security to the data, which might be enough for OP, but maybe not. – John Coleman Aug 15 '20 at 14:41
  • Thanks folks & will ask the question on Cross validated! Apologies for using the wrong forum! Completely agree that it wouldn't be the tightest of security but would probably be seeking enough masking so that only the very dedicated/skilled could figure out what the true value was. @Tanner, this is more time-series in that its multiple physiological variables collected over time, and not date per se. More so the actual values at each time point which are important and seeking to conceal/transform. Likewise, even the rate at which those variables change. Thanks for the interest/help! :-) – rob99985 Aug 17 '20 at 16:56

0 Answers0