11

As required by law in several countries we anonymize IP-addresses of our users in our log files. Using IPv4 we regularly just anonymize the two last bytes, eg. instead of 255.255.255.255 we log 255.255.\*.\*

What algorithm would you recommend to anonymize IPv6 addresses?

Yves M.
  • 29,855
  • 23
  • 108
  • 144
tec
  • 505
  • 3
  • 13

2 Answers2

14

At the very least you want to strip the EUI-64 off, i.e the last 64 bits of the address. more realistically you want to strip quite a lot more to really be private, since the remaining part will still identify only one subnet (i.e. one house possibly)

IPv6 global addressing is very hierarchical, from RFC2374:

 | 3|  13 | 8 |   24   |   16   |          64 bits               |
 +--+-----+---+--------+--------+--------------------------------+
 |FP| TLA |RES|  NLA   |  SLA   |         Interface ID           |
 |  | ID  |   |  ID    |  ID    |                                |
 +--+-----+---+--------+--------+--------------------------------+
 <--Public Topology--->   Site
                       <-------->
                        Topology
                                 <------Interface Identifier----->

The question becomes how private is private enough? Strip 64 bits and you've identified a LAN subnet, not a user. Strip another 16 on top of that and you've identified a small organisation, i.e. a customer of an ISP, e.g. company/branch office with several subnets. Strip the next 24 off an you've basically identified an ISP or really big organisation only.

You can implement this with a bitmask exactly like you would for an IPv4 address, the question becomes a legal one though of "how much do I need to strip to comply with the specific legislation", not a technical one at that point though.

Flexo
  • 87,323
  • 22
  • 191
  • 272
  • Thanks, @awoodland, that's the answer I have been hoping for. So I guess a safe approach is stripping the NLA, SLA and Interface IDs, i.e. only keep the first 24 bits. One could even strip the Reserved bits as they are zero anyway (thanks for the link to th RFC) so we'd keep two bytes when using IPv4 as well as when using IPv6. – tec May 23 '11 at 14:54
  • 2
    If you only keep 16 bits of a v6 address what you have is almost useless, for example look at the first 16 bits of addresses of production v6 sites listed in this directory: http://sixy.ch/ – Flexo May 23 '11 at 14:58
  • Sounds reasonable. Hm. Maybe a better approach is to keep the first byte of each of the sections. I guess we should discuss internally why we want to keep some of the bits anyway. Thanks for your help! – tec May 23 '11 at 15:11
  • 6
    @tec: Remember, instead of throwing away the data you can always hash it (plus a seed which you throw away after you are done). This prevents being able to find the source but (if done carefully) allows relationships to be preserved (e.g. know that these two addresses came from the same /64, or that these two may have come from the same /48 company, or...). You could hash, for example, the interface id by the public+site+seed bits, and hash the SLA by the public+seed, and hash the NLA by the RES+TLA+FP+seed, etc. Also make sure you cannot deduce the seed with a too-small result space. – Seth Robertson May 24 '11 at 04:06
  • I "hash" IP addresses by setting last few groups of 16 bits to its remainder after division by 16: ip[3] = ip[3] % 16; ... – Spikolynn May 07 '15 at 20:44
  • 1
    WRT "how much to strip", Google Analytics anonymizes IP addresses by zeroing the last octet of an IPv4 address and the **last 80 bits** (SLA ID + Interface ID) of an IPv6 address, per ["IP Anonymization (or IP masking) in Analytics"](https://support.google.com/analytics/answer/2763052?hl=en) (accessed 2020-11-12). – Jeremy W. Sherman Nov 12 '20 at 17:58
0

To anonymize public IPv6 addresses you could take the first 2 groups and replace the remaining part with CRC-16. Some examples (where abc1 and abc2 - are CRC-16 values):

  • 2001:0db8:85a3:0000:0000:8a2e:0370:7334 -> 2001:0db8-abc1
  • 2a02:200:7::123 -> 2a02:200-abc2

Such shortening allows easy matching of the first 2 groups (of course with some probability) with non-anonymized IPv6 in full logs having shorter retention time. Which is good for problem or security incident investigation.

Vlad Rudenko
  • 2,363
  • 1
  • 24
  • 24
  • Nice idea, however is that good enough? You could build a rainbow table for that crc16. – rekire Oct 14 '19 at 06:24
  • 1
    The CRC-16 is taken for 96 bits of data. So in the rainbow table, one CRC-16 value will point to 2^80 possible IPv6 addresses. Should be enough for anonymizing ;-) – Vlad Rudenko Oct 19 '19 at 09:27