mapreduce - how to anonymize a column's values

Question

input

1 - -  GET hm_brdr.gif 
2 - -  GET s102382.gif                ( "1", {"- -  GET hm_brdr.gif"})
3 - -  GET bg_stars.gif  map-reduce-> ( "2", {"- -  GET s102382.gif"}) 
3 - -  GET phrase.gif                 ( "3", {"- -  GET bg_stars.gif,"- -  GET phrase.gif"})

I want to make the first column values 1,2,3.. anonymous using random integers. But it shouldn't change it like 1->x in one line and 1->t in another line. so my solution is to replace the "keys" with random integers (rand(1)=x, rand(2)=y ..) in the reduce step and ungroup the values with their new keys and write to files again as shown below.

output file

x - -  GET hm_brdr.gif 
y - -  GET s102382.gif       
z - -  GET bg_stars.gif    
z - -  GET phrase.gif

my question is, is there a better way of doing this in the means of running time ?

score 0 · Answer 1 · edited May 23 '17 at 10:29

There is no way this is a bottleneck to your MapReduce job. More precisely, the runtime of your job is dominated by other concerns (network and disk I/O, etc.). A quick little key function? Meh.

But that's not even the biggest issue with your proposal. The biggest issue with your proposal is that it's doomed to fail. What is a key fact about keys? They serve as unique identifiers for records. Do random number generators guarantee uniqueness? No.

In fact, pretend for just a minute that your random key space has 365 possible values. It turns out that if you generate a mere 23 random keys, you are more likely than not to have a key collision; welcome to the birthday paradox. And all of a sudden, you've lost the whole point to the keys in the first place as you've started smashing together records by giving two that shouldn't have the same key the same key!

And you might be thinking, well, my key space isn't as small as 365 possible keys, it's more like 2^32 possible keys, so I'm, like, totally in the clear. No. After approximately 77,000 keys you're more likely than not to have a collision.

Your idea is just completely untenable because it's the wrong tool for the job. You need unique identifiers. Random doesn't guarantee uniqueness. Get a different tool.

In your case, you need a function that is injective on your input key space (that is, it guarantees that f(x) != f(y) if x != y). You haven't given me enough details to propose anything concrete, but that's what you're looking for.

And seriously, there is no way that performance of this function will be an issue. Your job's runtime really will be completely dominated by other concerns.

Edit:

To respond to your comment:

here i am actually trying to make the ip numbers anonymous in the log files, so if you think there is a better way i ll be happy to know.

First off, we have a serious XY problem here. You should have asked searched for answers to that question. Anonymizing IP addresses, or anything for that matter, is hard. You haven't even told us the criteria for a "solution" (e.g., who are the attackers?). I recommend taking a look at this answer on the IT Security Stack Exchange site.

i understand, i do this in reduce step so keys are already unique but random values doesn't guarantee it as you pointed out. thanks. — likeaprogrammer, Jul 20 '13 at 20:04

score 0 · Answer 2 · edited Jul 20 '13 at 19:10

0

If you want to assign a random integer to a key value then you'll have to do that in a reducer where all key/value pairs for that key are gathered in one place. As @jason pointed out, you don't want to assign a random number since there's no guarantee that a particular random number won't be chosen for two different keys. What you can do is just increment a counter held as an instance variable on the reducer to get the next available number to associate with a key. If you have a small amount of data then a single reducer can be used and the numbers will be unique. If you're forced to use multiple reducers then you'll need a slightly more complicated technique. Use

Context.getTaskAttemptID().getTaskID().getId()

to get a unique reducer number with which to calculate an overall unique number for each key.

edited Jul 20 '13 at 19:10

Trikaldarshiii

11,174
16
67
95

answered Jul 20 '13 at 17:47

Chris Gerken

16,221
6
44
59

thank you, i was already doing this in the reduce step. now i know i can't use random numbers. here i am actually trying to make the ip numbers anonymous in the log files, so if you think there is a better way i ll be happy to know. – likeaprogrammer Jul 20 '13 at 20:07
@proofmoore: Please see the edit I made to my [answer](http://stackoverflow.com/a/17764505/45914). – jason Jul 20 '13 at 20:55

mapreduce - how to anonymize a column's values

input

output file

2 Answers2

Edit: