I am trying to use a data anonymizer that I found in this github. I want to anonymize a 2000 records dataframe, however sequentially it takes a long time (for 100 records it is taking 10 minutes). I found this that uses swifter package, so i made this:
def anonymize(text, language="en"):
return PresidioHandler().anonymize_text(text=text, language=language)['text']
def anonymize_column(df, column, language="es"):
df_anonymized = df.copy()
# Apply anonymization to the values using swifter
df_anonymized[f'{column}_anon'] = df_anonymized[column].swifter.allow_dask_on_strings(enable=True).apply(anonymize, language=language)
return df_anonymized
if __name__ == '__main__':
df = pd.read_csv('data.csv')
df = df.drop_duplicates(ignore_index=True)
df = anonymize_column(df, 'Nombre del evento')
df.to_csv('annon_df.csv', index=False)
But I get a blue screen. So I tried Tom Raz's solution in this question, and this is my code:
from src.presidio_handler import PresidioHandler
import pandas as pd
import os
import numpy as np
from multiprocessing import Pool
from functools import partial
def anonymize(text, language="en"):
return PresidioHandler().anonymize_text(text=text, language=language)['text']
def parallelize(data, func, num_of_processes=8):
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
def run_on_subset(func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(data, func, num_of_processes=8):
return parallelize(data, partial(run_on_subset, func), num_of_processes)
if __name__ == '__main__':
df = pd.read_csv('data.csv')
df = df.drop_duplicates(ignore_index=True)
column_name = 'Nombre del evento'
df[f'{column_name}_anon'] = parallelize(df[column_name], anonymize, num_of_processes=8)
df.to_csv('annon_df.csv', index=False)
but when I execute, the ram, cpu, and disk go up to 100% of their capacity and I get a blue screen. I don't know what I'm doing wrong, as I've relied on stackoverflow's answer and I've also asked chatgpt a bit but nothing works. Does anyone know how to fix this? or do you know how I can parallelize this data anonymization I want to do?