Search and Replace in pandas dataframe for large dataset

Question

I have a dataset of size 1 million and type dataframe.

Id      description
 1      bc single phase acr
 2      conditioning accum
 3      dsply value ac

and dictionary of size 2927 which looks like as follow:

Key     Value
accum   accumulator
bb      baseboard
dsply   display

executed the following code to replace the dictionary key found in dataframe with its value

dataset=dataset.replace(dict, regex=True)

but it will consume more time to excecute i.e 104.07914903743769 sec for 2000 dataset and have 8GB RAM I need to apply this code for millions of dataset. so can anyone tell me how to reduce the excecution time? and also is there any alternate way to do the task?

@sshashank124 assuming that the OP wants to do partial replaces, then it is necessary to use `regex=True`. if not, map would be faster for a large dictionary. https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict/41678874#41678874 — JohnE, Feb 20 '18 at 14:23
Another Idea is to understand what percentage of rows actually contains the values needed to be replace. If the dataset is very large, you may see performance improvements by filtering down to only rows that has a value in the key section of your dictionary — MattR, Feb 20 '18 at 14:28

score 1 · Answer 1 · answered Feb 20 '18 at 14:22

1

I see a ~15% improvement precompiling regex.

But for optimal performance see @unutbu's excellent solution.

import pandas as pd
import re

rep_dict = {'accum': 'accumulator', 'bb': 'baseboard', 'dsply': 'display'}
pattern = re.compile("|".join([re.escape(k) for k in rep_dict.keys()]), re.M)

def multiple_replace(string):    
    return pattern.sub(lambda x: rep_dict[x.group(0)], string)

df = pd.DataFrame({'description': ['bc single phase acr', 'conditioning accum', 'dsply value ac']})
df = pd.concat([df]*10000)

%timeit df['description'].map(multiple_replace)          # 72.8 ms per loop
%timeit df['description'].replace(rep_dict, regex=True)  # 88.6 ms per loop

answered Feb 20 '18 at 14:22

jpp

159,742
34
281
339

if df changes to `df = pd.DataFrame({'description': ['bc single phase acr', 'conditioning accumulator', 'dsply value ac']})` then after replacement second value of df changes to `conditioning accumulatorulator` which should not happen know. – Ranjana Girish Feb 21 '18 at 13:37
@RanjanaGirish. That's true, but this is also true of your existing solution `dataset=dataset.replace(dict, regex=True)`. My reading of your question was that you need to optimize this function. Your new requirement will mean any process will be even slower as you need to split by spaces. – jpp Feb 21 '18 at 14:01

score 0 · Answer 2 · answered Feb 20 '18 at 14:26

The task can be easely splitted as your operation is independent for each line. The speed could be imroved by multithreading or even calculating on several computers (this is a classical map operation in MapReduce paradigm). The same thing for memory consumption: you can load not the full dataset, but partially, apply replace, save results and go on futher.

Search and Replace in pandas dataframe for large dataset

2 Answers2