-1

I'm trying to come up with a data masking technique that involves replacing the actual data with reversible fake data.

For Example:

If my data consists of the string 'Hello' I'd like to mask it using 'Hi' and then be able to revert it to the original string 'Hello' using a key or some algorithm .

'Hello' --- Mask---> 'Hi' --- unMask---> 'Hello'

I've done some research and found a Fisher–Yates shuffle algorithm which probably would work in my case.

I've thought of trying to create an implementation of the above algorithm by shuffling the string and then trying to revert it using a mechanism like a key.

'Hello' --- Mask---> 'ellho' --- unMask---> 'Hello'

However I'm not quite sure how to implement the approach.

Please Advise.

The Singularity
  • 2,428
  • 3
  • 19
  • 48
  • 1
    Does it need to be per word? Otherwise, you can also use rot13 for a start, which works on a character base. – 9769953 Oct 19 '21 at 07:15
  • 1
    Maybe use something like a Ceasar Cipher? https://stackoverflow.com/questions/8886947/caesar-cipher-function-in-python – Jan Willem Oct 19 '21 at 07:16
  • You are basically asking for encryption if I understood correctly – Dani Mesejo Oct 19 '21 at 07:24
  • Depending on your data size maintaining a dictionnary of `fakeValue => realValue` may help and probably would be simpler than using a cryptographic approach. I think a bit of context would help to point you in the right direction. – AsTeR Oct 19 '21 at 07:35
  • @9769953 I was thinking of using it per word – The Singularity Oct 19 '21 at 07:37
  • How about using AES in CTR mode which does not change the length of the buffer and is very fast on modern hardware. Encrypt each line separately by joining words of given line (and storing length of each word appended to buffer) then encrypt and split the output to pieces of the same lengths as input words. – Mr. Girgitt Oct 19 '21 at 07:38
  • @DaniMesejo, Yes an encryption-decryption approach where the underlying data is hidden by fake data of a similar kind would be great! – The Singularity Oct 19 '21 at 07:38
  • @AsTeR I've thought about maintaining a dictionary, but I'm dealing with Big Data here, I don't think that it would be feasible – The Singularity Oct 19 '21 at 07:39
  • @Mr.Girgitt I'm very new to cryptography, would you mind elaborating your answer, perhaps with a minimum reproducible example? – The Singularity Oct 19 '21 at 07:41
  • "Big Data" is a meaningless term without numbers. But what *actually* do you want to achieve? Mask some text? For what purpose? Security, obscurity, anonymization? – 9769953 Oct 19 '21 at 07:49
  • I would say millions of records and yes for security, obscurity and anonymization goals. Data would mainly consist of Personally identifiable information (PII) data – The Singularity Oct 19 '21 at 07:53
  • 1
    Hmm, obfuscation, security, and anonymizations are very different concepts. When data is only obfuscated, someone with full access to the code used to obfuscate could guess the original data (rot13 is an obfuscation example). When data is securely encrypted, full knowledge of the encryption algorithm is not enough to get the plain text data, but a decrypting key is also required. And when data is anonymized, it is absolutely impossible to recover the nominative informations (they have been fully destroyed). You have to choose only one... – Serge Ballesta Oct 19 '21 at 08:07
  • 1
    And except for obfuscation, the rule is *do not roll your own* unless you are a true security expert. Security is highly complex, and professionals only trust well known implementation of well known algorithms, because the devil can hide in the implementation details... (BTW, security experts know that they cannot produce a secure encryption without extensive peer reviews...) – Serge Ballesta Oct 19 '21 at 08:13
  • @SergeBallesta according to your description I was looking for a combination of encryption-decryption and anonymization. – The Singularity Oct 19 '21 at 08:16
  • 1
    Anonymization is the easy part: just use a non reversible hash. You could have some in the hashlib module (search Python official documentation for *cryptography*). Pay attention to the fact that non anonymized fields cannot allow to guess the anonymized part. An example is anonymizing the names but not the town, where at least one *town* has less than 10 inhabitants (it is rather common in France...). ... – Serge Ballesta Oct 19 '21 at 08:28
  • 1
    ... Securely encrypting data in a database is much more complex, because you must wonder where the decrypting key will be stored. If it is stored *in the system hosting the database*, you only have obfuscation. So you must think of what the threats are, and who you can trust. Many professional system only ensure *security at rest*: a full copy of the database is still securely encrypted, but the running database is only *obfuscated* because the application requires an access to the decrypting key... – Serge Ballesta Oct 19 '21 at 08:33
  • 1
    ... A fully secure encryption of a running database can only be achieved if the encryption-decryption occurs client side. I am sorry, but security is not a piece of cake. I cannot give a true answer without full knowledge of your real problem, and that would be impossible on a simple Q&A site like SO. If your database only contains the names of your cousins and friends, you are on your own. If it contains mission critical or nominative health data, you really should consult (or hire) experts. I know that this is definitely not an answer, hence I only commented. – Serge Ballesta Oct 19 '21 at 08:38

3 Answers3

2

Just following my comment: Here is (an overkill of) an example using AES encryption in CTR mode. In the comment, I stated the encryption will keep the plaintext's length unchanged but it's valid only for the binary format. If text needs to be printed the example here changes is to hex output which doubles the length. Encrypting line by line probably makes no sense but this example should give an idea of how to accomplish the goal. Probably the encryption/decryption methods can be changed to Lorenz cipher for simplicity if security is not the goal of this kind of fake data masking.

EDIT: minimum running example for python 3.6 with cryptography module installed is available here: https://gist.github.com/Girgitt/7cbfe8e6ffdcf7eba333c348cdcd1642

EDIT: examples fixed for py3.6 (initially made for py2.7)

from unittest import TestCase
from FileContainingEncryptionClass import Encryption


class test_Encryption(TestCase):
    def test_plaintext_encryption(self):
        plaintext = 'some words to encrypt'
        words_lengths = [len(item) for item in plaintext.split(" ")]

        plaintext_joined = plaintext.replace(" ", "")
        encryptor = Encryption('some key', 'some nonce')
        encryptor.init_encryption()
        encryptor.update_payload_to_encrypt(bytearray(plaintext_joined, 'utf8'))
        cipher_as_text = ''.join([hex(item).lstrip('0x').zfill(2) for item in encryptor.encrypted_payload])
        self.assertEqual("c8638dd3ee70e8a7bf9c1c943507fe61b8cb", cipher_as_text)
        split_encrypted_in = []
        for word_len in words_lengths:
            split_encrypted_in.append(cipher_as_text[:2 * word_len])
            cipher_as_text = cipher_as_text[2 * word_len:]
        split_encrypted = " ".join(split_encrypted_in)
        self.assertEqual("c8638dd3 ee70e8a7bf 9c1c 943507fe61b8cb", split_encrypted)

        decryptor = Encryption('some key', 'some nonce')
        decryptor.init_decryption()
        joined_encrypted = split_encrypted.replace(" ", "")
        self.assertEqual("c8638dd3ee70e8a7bf9c1c943507fe61b8cb", joined_encrypted)
        binary_encrypted = bytearray.fromhex(joined_encrypted)

        decryptor.update_payload_to_decrypt(binary_encrypted)
        plaintext_joined = decryptor.decrypted_payload.decode('utf8')
        self.assertEqual("somewordstoencrypt", ''.join([chr(ord(item)) for item in plaintext_joined]))

        plaintext_words = []
        plaintext_words_lengths = [int(len(item) / 2) for item in split_encrypted.split(" ")]
        self.assertEqual([4, 5, 2, 7], plaintext_words_lengths)

        for word_len in plaintext_words_lengths:
            plaintext_words.append(plaintext_joined[:word_len])
            plaintext_joined = plaintext_joined[word_len:]

        decrypted_plaintext = ' '.join(plaintext_words)

        self.assertEqual("some words to encrypt", decrypted_plaintext)

The example code uses the Encryption class:

from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes


class Encryption(object):
    def __init__(self, key='aKeyNobodyWIllEverUse', nonce='PleaseMakeMeRandomEachTime'):
        key = str(key)
        while len(key) < 32:
            key += key
        key = bytearray(key[:32], 'utf8')

        nonce = str(nonce)
        while len(nonce) < 16:
            nonce += nonce
        nonce = bytearray(nonce[:16], 'utf8')

        backend = default_backend()

        self._cipher = Cipher(algorithms.AES(key), modes.CTR(nonce), backend=backend)
        self._encryptor = None
        self._encrypted_payload = None
        self.init_encryption()
        self._decryptor = None
        self._decrypted_payload = None
        self.init_decryption()

    def init_encryption(self):
        self._encryptor = self._cipher.encryptor()
        self._encrypted_payload = None

    def update_payload_to_encrypt(self, payload):
        if self._encryptor:
            self._encrypted_payload = self._encryptor.update(payload)

    @property
    def encrypted_payload(self):

        if self._encrypted_payload:
            return self._encrypted_payload

        return ''

    def init_decryption(self):
        self._decryptor = self._cipher.decryptor()
        self._decrypted_payload = None

    def update_payload_to_decrypt(self, payload):
        if self._decryptor:
            self._decrypted_payload = self._decryptor.update(payload)

    @property
    def decrypted_payload(self):
        if self._decrypted_payload:
            return self._decrypted_payload

        return ''
Mr. Girgitt
  • 2,853
  • 1
  • 19
  • 22
  • Would you mind including a working example of the code in your answer as well? – The Singularity Oct 19 '21 at 08:53
  • the test itself is the working example. Ok it is lacking import statement for the Encryption class but the import depends on the name of the file you put the Crypto class to. The "split_encrypted" string is the encrypted output. – Mr. Girgitt Oct 19 '21 at 08:55
  • I am confused on how to execute your code at the moment, would you mind adding the driver code? – The Singularity Oct 19 '21 at 09:07
  • 2
    @Luke I added gist https://gist.github.com/Girgitt/7cbfe8e6ffdcf7eba333c348cdcd1642 with a simple example. However this code is extremely inefficient. Real "production" like code should encrypt whole files and not single lines. I you don't need cryptography follow bijective function based approach (beautifully) presented by Neil – Mr. Girgitt Oct 19 '21 at 09:42
  • `TypeError: key must be bytes-like` on the gist – The Singularity Oct 19 '21 at 09:46
  • 2
    @Luke gist is update to work with py3.6 – Mr. Girgitt Oct 19 '21 at 10:06
1

What you want is a bijective function on the set of strings. Shuffling letters is the easiest way to make it bijective because a re-arrangement can always be reversed. So words are all going to be mapped to words of the same length. A re-arrangement can be described by the change in the index of each character.

Here's an approach that would shuffle with a key. This isn't any sort of encryption, I see that discussed in the comments but wasn't in the original question. Don't use this if security is a thing.

import random

# I would make a list of keys for each possible string length.

keys = {}
# Pneumonoultramicroscopicsilicovolcanoconiosis longest word in english 45 characters
for i in range(2, 45): # you can't shuffle length 0 or 1 strings.
    key = list(range(i))
    while key == list(range(i)): # just incase it randomly ends up being the same on the first try or thereafter. technically possible unless random.shuffle has a built in check. 
        random.shuffle(key)
    keys[i] = key
    
# Mask Function    
def mask(word: str):
    key = keys[len(word)]
    # I'm quite certain there will be some builtin library that can do this with one 
    # function call and efficiently but I'll do it manually here.
    new_word_characters = ["", ]*len(word)
    for i, character in zip(key, word):
        new_word_characters[i] = character
    new_word = "".join(new_word_characters)
    return new_word

# unMask Function    
def unmask(word: str):
    key = keys[len(word)]
    new_word_characters = ["", ]*len(word)
    k = 0
    for i, character in zip(key, word):
        new_word_characters[k] = word[i]
        k += 1
    new_word = "".join(new_word_characters)
    return new_word
  
  
  
  
mask('Hello') # Results in 'leHlo'
unmask(mask('Hello')) # Results in 'Hello'
The Singularity
  • 2,428
  • 3
  • 19
  • 48
Neil
  • 3,020
  • 4
  • 25
  • 48
0

First thing that comes to my mind is encode and then decode the string.

Or something fun like this, but this can be cracked..

text = "String to encode"
print(text)

text_utf = text.capitalize()[::-1]
print(text_utf)

original_text = text_utf[::-1]
print(original_text)

Output:
String to encode
edocne ot gnirtS
String to encode

Gedas Miksenas
  • 959
  • 7
  • 15