3

I'm writing some code to anonymize an R dataset in such a way that it strips any useful information out of the data while preserving the structure that would be important for running regressions, etc on it. I want to be sure I've removed all possible places any telling information about the data could be hiding. My process so far is:

  1. Replace variable names of data frame with uninformative names (x1, x2, ...)
  2. Turn all categorical variables into factors with simple numerical levels
  3. Scale and center all numerical variables (except logical or 0/1)
  4. Use attributes(x) <- NULL to strip things like variable labels added through haven, etc.

I'm trying to keep my tin foil hat on when specing out this procedure. Have I covered all my bases, or is there some other way information about the data contents could be hiding in my dataset?

NB: I'm specifically asking about whether I have removed all the information explicitly contained in the R objects. For example, a novice R user who didn't know about attributes might think that steps 1 - 3 on their own were sufficient to strip an object of readable information. I would like to ascertain whether there are other features I might potentially need to strip out. The question of whether there's any telling information in the structure of the data itself is pertinent to my broader task but out of scope for this site, and I imagine there could be reams written on it.

Empiromancer
  • 3,778
  • 1
  • 22
  • 53
  • 2
    What you describe is probably not sufficient. After all that, I'd take summary stats of the real data and then simulate fake data from it (taking into account hierarchical structure and whatnot). Anyway, this is not a concrete programming question and so isn't a great fit for this site. You might want to try https://community.rstudio.com/ for such open-ended discussions. – Frank Sep 21 '17 at 17:05
  • How are you sharing the data, and with whom, and what aspects of the data should be kept secret from them? It's really hard to give advice because what you need to do will depend on who you're trying to hide something from, and why. – Mike Stanley Sep 21 '17 at 17:06
  • what will the recipients know ? – moodymudskipper Sep 21 '17 at 17:14
  • I agree with Frank, but I'd fudge the data with some random noise. – Axeman Sep 21 '17 at 17:47
  • Frank, Mike, thanks for pointing out that my question didn't clearly explain the scope of what I'm looking for. You're right, if I was looking for general advice on how to safely anonymize data this wouldn't be the right forum for it. My intent was to ask whether I'd stripped out all of the metadata that could be hiding in the R objects, i.e., a question about the structure of R objects, which I believe *is* suited for this site. I'll edit my question to better explicate this. – Empiromancer Sep 21 '17 at 18:44
  • Using `str()` and/or `dput()` you should be able to see anything "hidden" in your data frame, but all that's stored there are class/attribute data. By the time you write out to `.csv` or something what you see is what you have in the case of a data.frame. If you were using a different data class that's storing multiple lists or something, you'd need to be more careful. – Mako212 Sep 21 '17 at 18:49
  • Heck, when done sanitizing, save it as a CSV, JSON, JSON-NL, or similarly-transparent format. You can see and verify that nothing else is there. The only way you could leak R-based meta-data is if you give them the R object as an `.rda` or `.rds`, I think. (Unless you are providing a function that they call from their own R instance, in which case you are only keeping the honest away.) – r2evans Sep 21 '17 at 22:31

0 Answers0