6

I have a dataset of size 1 million and type dataframe.

Id      description
 1      bc single phase acr
 2      conditioning accum
 3      dsply value ac
and dictionary of size 2927 which looks like as follow:
Key     Value
accum   accumulator
bb      baseboard
dsply   display

executed the following code to replace the dictionary key found in dataframe with its value

dataset=dataset.replace(dict, regex=True)

but it will consume more time to excecute i.e 104.07914903743769 sec for 2000 dataset and have 8GB RAM I need to apply this code for millions of dataset. so can anyone tell me how to reduce the excecution time? and also is there any alternate way to do the task?

Nishant Gupta
  • 3,533
  • 1
  • 11
  • 18
Shylashree
  • 91
  • 3
  • 3
    Is `regex=True` necessary? – sshashank124 Feb 20 '18 at 14:01
  • 1
    are you trying to do the replacing on only 1 column? – MattR Feb 20 '18 at 14:03
  • 1
    If needed you could do it in chunks – Maarten Fabré Feb 20 '18 at 14:16
  • 1
    @sshashank124 assuming that the OP wants to do partial replaces, then it is necessary to use `regex=True`. if not, map would be faster for a large dictionary. https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict/41678874#41678874 – JohnE Feb 20 '18 at 14:23
  • 1
    Another Idea is to understand what percentage of rows actually contains the values needed to be replace. If the dataset is very large, you may see performance improvements by filtering down to only rows that has a value in the key section of your dictionary – MattR Feb 20 '18 at 14:28
  • yeah trying to replacing on only 1 column..@MattR – Shylashree Feb 21 '18 at 06:30

2 Answers2

1

I see a ~15% improvement precompiling regex.

But for optimal performance see @unutbu's excellent solution.

import pandas as pd
import re

rep_dict = {'accum': 'accumulator', 'bb': 'baseboard', 'dsply': 'display'}
pattern = re.compile("|".join([re.escape(k) for k in rep_dict.keys()]), re.M)

def multiple_replace(string):    
    return pattern.sub(lambda x: rep_dict[x.group(0)], string)

df = pd.DataFrame({'description': ['bc single phase acr', 'conditioning accum', 'dsply value ac']})
df = pd.concat([df]*10000)

%timeit df['description'].map(multiple_replace)          # 72.8 ms per loop
%timeit df['description'].replace(rep_dict, regex=True)  # 88.6 ms per loop
jpp
  • 159,742
  • 34
  • 281
  • 339
  • if df changes to `df = pd.DataFrame({'description': ['bc single phase acr', 'conditioning accumulator', 'dsply value ac']})` then after replacement second value of df changes to `conditioning accumulatorulator` which should not happen know. – Ranjana Girish Feb 21 '18 at 13:37
  • @RanjanaGirish. That's true, but this is also true of your existing solution `dataset=dataset.replace(dict, regex=True)`. My reading of your question was that you need to optimize this function. Your new requirement will mean any process will be even slower as you need to split by spaces. – jpp Feb 21 '18 at 14:01
0

The task can be easely splitted as your operation is independent for each line. The speed could be imroved by multithreading or even calculating on several computers (this is a classical map operation in MapReduce paradigm). The same thing for memory consumption: you can load not the full dataset, but partially, apply replace, save results and go on futher.

Leonid Mednikov
  • 943
  • 4
  • 13