Pandas multiprocessing apply usses al resources from the pc

Question

I am trying to use a data anonymizer that I found in this github. I want to anonymize a 2000 records dataframe, however sequentially it takes a long time (for 100 records it is taking 10 minutes). I found this that uses swifter package, so i made this:

def anonymize(text, language="en"):
    return PresidioHandler().anonymize_text(text=text, language=language)['text']

def anonymize_column(df, column, language="es"):
    df_anonymized = df.copy()

    # Apply anonymization to the values using swifter
    df_anonymized[f'{column}_anon'] = df_anonymized[column].swifter.allow_dask_on_strings(enable=True).apply(anonymize, language=language)

    return df_anonymized

if __name__ == '__main__':

    df = pd.read_csv('data.csv')
    df = df.drop_duplicates(ignore_index=True)
    df = anonymize_column(df, 'Nombre del evento')
    df.to_csv('annon_df.csv', index=False)

But I get a blue screen. So I tried Tom Raz's solution in this question, and this is my code:

from src.presidio_handler import PresidioHandler
import pandas as pd
import os
import numpy as np
from multiprocessing import Pool
from functools import partial

def anonymize(text, language="en"):
    return PresidioHandler().anonymize_text(text=text, language=language)['text']

def parallelize(data, func, num_of_processes=8):
    data_split = np.array_split(data, num_of_processes)
    pool = Pool(num_of_processes)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

def run_on_subset(func, data_subset):
    return data_subset.apply(func, axis=1)

def parallelize_on_rows(data, func, num_of_processes=8):
    return parallelize(data, partial(run_on_subset, func), num_of_processes)

if __name__ == '__main__':

    df = pd.read_csv('data.csv')
    df = df.drop_duplicates(ignore_index=True)
    column_name = 'Nombre del evento'
    df[f'{column_name}_anon'] = parallelize(df[column_name], anonymize, num_of_processes=8)
    df.to_csv('annon_df.csv', index=False)

but when I execute, the ram, cpu, and disk go up to 100% of their capacity and I get a blue screen. I don't know what I'm doing wrong, as I've relied on stackoverflow's answer and I've also asked chatgpt a bit but nothing works. Does anyone know how to fix this? or do you know how I can parallelize this data anonymization I want to do?

Multiprocessing replicate data on each process. This require a lot of RAM. When there is not enough RAM available, the OS store RAM data on the disk which is much slower. When there is not enough space on the disk, then you can get a blue screen. Using multiple processes is not efficient, and only useful if the target package do not use shared resources likes the GPU. Multiprocessing is the standard way to parallelize things in Python because the GIL prevent nearly any speed up with multithreading. This is just not a language designed for that. Shared arrays could possibly help though. — Jérôme Richard, Jul 08 '23 at 13:41
It is not clear whether the computation is done locally due to the imported module (like openai). If this is the case, then you will need at least several GiB of RAM per process and the computation will be parallel so multiprocessing is useless (ie. the process already runs at full speed). If this is done remotely, multiprocessing is also useless and multithreading can be useful assuming the packages support it (generally no). In this case, the package would be actually a joke since private data would be sent to a remote server anyway (certainly outside your country btw)... — Jérôme Richard, Jul 08 '23 at 13:45
I understand, thank you for your explanation. I will try other things then. — suribe06, Jul 08 '23 at 21:00
Have you ever heard about `dask`? This might be the right choice for you. — Bracula, Jul 09 '23 at 23:34

Pandas multiprocessing apply usses al resources from the pc

0 Answers0