Hi All I reposted this question because my previous question violated the StackOverflow rules
I want to create a python script that can mask/anonymize the information inside each csv column without removing its content. Because the data will be used for further analysis and doing some statistical modelling. The data mostly contain user ID, project ID, Customer ID, address of the customer, name of the customer, order type, email address. I'm kinda stuck on the current progress as I wanted to make this process more effective
- How could I do this process more scalable, meaning I don't need to create a script for each CSV file but more into how could I use some technique to apply the script to every CSV files without rewriting from scratch?
My current approach: My approach right now is by dealing on each column one by one by doing something on it. For example the user ID, I replaced it with the additional string in front of the unique value ( for example since user ID 1234 in the first row, it gets replaced by user_0)
Please give me some advice and I would like to discuss so that I can do a more effective way
Edit: This how the data looks like (I hope I put it in the allowable format)
plant_id project_id plant_name project_name address customer_id project type
---------- ------------ --------------- -------------------------------- ----------------- ------------- --------------
15052.0 6496 Manufacturing ASAHI,PT-PRO/PTN/06-2012/192 streetname-city e8cfa43f Individual
15052.0 6458 Manufacturing CIMB NIAGA-PRO/PTN/06-2012/174 streetname-city 7b2bf5dc Individual
15052.0 11441 Manufacturing DM STOCK 2015 streetname-city dc0c9893 Corporate
The example of the expected output that I want to try first:
plant_id project_id plant_name project_name address customer_id project type
---------- ------------ --------------- --------------------- ----------------- ------------- --------------
123 1111 AAAAAAAAAAAAA ABCDEFGHIJKLMNOPQ XYXYXYXYXYXY abcd1111 2
123 2222 AAAAAAAAAAAAA FGHJKLMNABCDEFGHH XYXYXYXYXYXY abcd2222 2
123 3333 AAAAAAAAAAAAA FGHFDGDGASDADAFAH XYXYXYXYXYXY abcd3333 3
And this is my current code
data['customer_id'] = 'user_' + (pd.Series(pd.factorize(data['customer_id'])[0] + 1)).astype(str)
data['project_id'] = 'Project_' + (pd.Series(pd.factorize(data['project_id'])[0] + 1)).astype(str)