1

Hi All I reposted this question because my previous question violated the StackOverflow rules

I want to create a python script that can mask/anonymize the information inside each csv column without removing its content. Because the data will be used for further analysis and doing some statistical modelling. The data mostly contain user ID, project ID, Customer ID, address of the customer, name of the customer, order type, email address. I'm kinda stuck on the current progress as I wanted to make this process more effective

  1. How could I do this process more scalable, meaning I don't need to create a script for each CSV file but more into how could I use some technique to apply the script to every CSV files without rewriting from scratch?

My current approach: My approach right now is by dealing on each column one by one by doing something on it. For example the user ID, I replaced it with the additional string in front of the unique value ( for example since user ID 1234 in the first row, it gets replaced by user_0)

Please give me some advice and I would like to discuss so that I can do a more effective way

Edit: This how the data looks like (I hope I put it in the allowable format)

plant_id   project_id    plant_name              project_name                address       customer_id   project type  
 ---------- ------------ --------------- -------------------------------- ----------------- ------------- -------------- 
   15052.0         6496   Manufacturing   ASAHI,PT-PRO/PTN/06-2012/192     streetname-city   e8cfa43f      Individual    
   15052.0         6458   Manufacturing   CIMB NIAGA-PRO/PTN/06-2012/174   streetname-city   7b2bf5dc      Individual    
   15052.0        11441   Manufacturing   DM STOCK 2015                    streetname-city   dc0c9893      Corporate

The example of the expected output that I want to try first:

plant_id   project_id    plant_name       project_name          address       customer_id   project type  
 ---------- ------------ --------------- --------------------- ----------------- ------------- -------------- 
   123         1111     AAAAAAAAAAAAA     ABCDEFGHIJKLMNOPQ     XYXYXYXYXYXY   abcd1111      2    
   123         2222     AAAAAAAAAAAAA     FGHJKLMNABCDEFGHH     XYXYXYXYXYXY   abcd2222     2    
   123         3333     AAAAAAAAAAAAA     FGHFDGDGASDADAFAH     XYXYXYXYXYXY abcd3333       3

And this is my current code

data['customer_id'] = 'user_' + (pd.Series(pd.factorize(data['customer_id'])[0] + 1)).astype(str)
data['project_id'] = 'Project_' + (pd.Series(pd.factorize(data['project_id'])[0] + 1)).astype(str)
  • For this kind of review cases https://codereview.stackexchange.com/ can be a better address. But, even if you post there or here, you'd better share the code you're using right now. – vahdet Dec 12 '19 at 08:18
  • You could apply scikit-learns `LabelEncoder` to multiple columns - Check [this question](https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn) – Chris Adams Dec 12 '19 at 08:18
  • Hi, vahdet I have edited and added with a snippet of my dataset and the code that I do have right now. Chris, I got the idea for label encoder but it's only can handle the categorical column right? Let's say in future I have some continuous values. Because I'm thinking to create a more general solution where it' can be applied into most of the dataset. Anyways I'll consider using label encoder to handle the categorical value –  Dec 12 '19 at 09:06

1 Answers1

0

Note: this is an answer, hence the answer entry, but also entirely my opinion based on the limited info provided.

To be clear, my understanding of the issue:
- you want to anonymize the data in a table,
- but preserve the contents of each field individually
- and preserve the columns that the data belongs so that the data can still be used for statistics
- and you want to be able to undo the whole thing and return the data table to its original form.

Is this accurate?

If so, and even if I'm off the mark by a little, you could build a simple algorithm that uses all of the data in the table to create a numeric key. This way, the numeric key will be unique for each table of data (this numeric key is called a 'nonce'). Then, use the nonce to offset a set of rules that shuffle the fields within each column.

You would have "encrypt" and "decrypt" methods that do all the work. Anyone who sees/posesses the data would need to know exactly the set of rules and the nonce algorithm in order to decrypt the shuffle, which would only happen if you shared said info. And even then, if someone figured it out, you could just add a new element to the nonce algorithm, like including the file timestamp in the equation, which would completely change the output.

The best part of this is that if you have the knowledge about the cypher (the algorithms and rules), the nonce and everything else you need is contained and readily available within the data that you have obfuscated, so there is no chance of losing a passowrd or some other human error. Worst case, you lose the data, in which case the decryption key is useless anyway. :)


Read up on cryptography if you're not familiar with this whole process. Pretty typical stuff found in secure internet data, cryptocurrency, etc.

Share an actual data set if you want (or a fake one if the info is too sensitive), and I'll put together an example program. The fundamental elements I described in this answer make for a very small program. Probably about 10 (well thought out) lines of code to encrypt and decrypt.
Hope this helps!

Edit:: ADDENDUM-A If text data encryption is also required, e.g. completely garbling home addresses, this also fits just fine into the nonce algo idea.

befunkt
  • 131
  • 8
  • Hi befunkt, correct me if I'm wrong. So there is a solution to create some algorithm to create encryption and decryption of the information inside. But once the person can decrypt it, will the data that they see is the original data that is being shared or it goes into some data shuffling first? Yeah, the points that you shared is the solution that I would like to proceed with. Thanks for clarifying it to me. Yes, I can share the data but I don't know how to put in the proper tabular format in StackOverflow. –  Dec 12 '19 at 13:43
  • I've already put the snippet of the dataset, I hope 3 rows it should be allowable for the StackOverflow format –  Dec 12 '19 at 14:32
  • Hi Rizki, any of the above discussed options are possible regarding shuffling and/or encryption. You will need to explain the exact output you want to see again please. I see the 3 lines of data you provided, but can you also take those 3 lines and create a table of the desired output (or tables, if you want multiple unique output states)? For text encryption of the data, you can just make up false characters e.g. "Mary Poppins" could become "sK4,>psOl09.2'?" in the output table – befunkt Dec 13 '19 at 12:11
  • Hi @befunkt, thanks for clarifying my doubt. What I want to get in the expected output is the random number and alphabet in each column while retaining its original value once it's being decrypted. Why I want to do this is because it will help the other company that got this data for further analysis. The random number and alphabet don't have to be same for each similar userID, projectID, or plantID. For example the first plantID of 15052.0 doesn't have to be same with the second plantID of 15052.0. I hope you got what the expected output that I meant –  Dec 13 '19 at 14:17
  • I see. This was not what I had in mind before. It sounds like you just need to encrypt your data. Correct? In what format do you send the data to another company? .xlsx ? – befunkt Dec 16 '19 at 00:48
  • Yes in the .csv or .xlsx file will do –  Dec 16 '19 at 02:24
  • Microsoft Excel has a password protection feature, which encrypts the file. Then, any user with the file and password can decrypt the file, using Excel. Does this satisfy your requirements? – befunkt Dec 19 '19 at 01:06
  • sorry there is some slightly change because what I want to do now is only to anonymize the ID columns such as User ID, Customer ID, Transaction ID, project ID. Once the ID is being altered it doesn't need to be reopened or reaccessed to get the original ID –  Dec 20 '19 at 14:55