0

I have a dataset of 8500 rows of text. I want to apply a function pre_process on each of these rows. When I do it serially, it takes about 42 mins on my computer:

import pandas as pd
import time
import re

### constructing a sample dataframe of 10 rows to demonstrate
df = pd.DataFrame(columns=['text'])
df.text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]

def pre_process(text):
    '''
    function to pre-process and clean text
    '''
    stop_words = ['in', 'of', 'at', 'a', 'the']

    # lowercase
    text=str(text).lower()

    # remove special characters except spaces, apostrophes and dots
    text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text)

    # remove stopwords
    text=[word for word in text.split(' ') if word not in stop_words]

    return ' '.join(text)

t = time.time()
for i in range(len(df)):
    df.text[i] = pre_process(df.text[i])

print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))

>>> Time taken for pre-processing the data = 41.95724259614944 mins

So, I want to make use of multiprocessing for this task. I took help from here and wrote the following code:

import pandas as pd
import multiprocessing as mp

pool = mp.Pool(mp.cpu_count())

def func(text):
    return pre_process(text)

t = time.time()
results = pool.map(func, [df.text[i] for i in range(len(df))])
print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))

pool.close()

But the code just keeps on running, and doesn't stop.

How can I get it to work?

Kristada673
  • 3,512
  • 6
  • 39
  • 93

2 Answers2

1

you can use pandas.DataFrame.apply

df.text= df.text.apply(pre_process)
Shijith
  • 4,602
  • 2
  • 20
  • 34
1

This following code works for me though. I don't use func and usepre_process straight away. Also, I use context manager/with statement on the pool

Below is the code running in IPython.

In [1]: from multiprocessing import Pool, TimeoutError 
    ...: import time 
    ...: import os           

In [2]: text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to 
    ...: make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
    ...:  
    ...:  "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a
    ...:  column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision
    ...:  of J.R.R. Tolkien 's Middle-earth .", 
    ...:  'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more s
    ...: imply intrusive to the story -- but the whole package certainly captures the intended , er , spi
    ...: rit of the piece .', 
    ...:  "You 'd think by now America would have had enough of plucky British eccentrics with hearts of 
    ...: gold .", 
    ...:  'Yet the act is still charming here .', 
    ...:  "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the
    ...:  self , '' Derrida is an undeniably fascinating and playful fellow .", 
    ...:  'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro o
    ...: f madness and light is astonishing .', 
    ...:  'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .', 
    ...:  "a screenplay more ingeniously constructed than `` Memento ''", 
    ...:  "`` Extreme Ops '' exceeds expectations ."]                       

In [3]: def pre_process(text): 
    ...:     ''' 
    ...:     function to pre-process and clean text 
    ...:     ''' 
    ...:     stop_words = ['in', 'of', 'at', 'a', 'the'] 
    ...:  
    ...:     # lowercase 
    ...:     text=str(text).lower() 
    ...:  
    ...:     # remove special characters except spaces, apostrophes and dots 
    ...:     text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text) 
    ...:  
    ...:     # remove stopwords 
    ...:     text=[word for word in text.split(' ') if word not in stop_words] 
    ...:  
    ...:     return ' '.join(text) 


In [4]: %%time 
    ...: result = [] 
    ...: for x in text: 
    ...:     result.append(pre_process(x)) 
    ...:  
    ...:                                                                                                 
CPU times: user 500 µs, sys: 59 µs, total: 559 µs
Wall time: 569 µs

In [5]: %%time 
    ...: with Pool(mp.cpu_count()) as pool: 
    ...:     results = pool.map(pre_process, text) 
    ...:  
    ...:                                                                                          
CPU times: user 4.58 ms, sys: 29 ms, total: 33.6 ms
Wall time: 137 ms

In [6]: results                                                                                        
Out[6]: 
["rock is destined to be 21st century 's new conan '' and that he 's going to make splash even greater than arnold schwarzenegger jean claud van damme or steven segal .",
 "gorgeously elaborate continuation lord rings '' trilogy is so huge that column words can not adequately describe co writer director peter jackson 's expanded vision j.r.r. tolkien 's middle earth .",
 'singer composer bryan adams contributes slew songs few potential hits few more simply intrusive to story but whole package certainly captures intended er spirit piece .',
 "you 'd think by now america would have had enough plucky british eccentrics with hearts gold .",
 'yet act is still charming here .',
 "whether or not you 're enlightened by any derrida 's lectures on other '' and self '' derrida is an undeniably fascinating and playful fellow .",
 'just labour involved creating layered richness imagery this chiaroscuro madness and light is astonishing .',
 'part charm satin rouge is that it avoids obvious with humour and lightness .',
 "screenplay more ingeniously constructed than memento ''",
 " extreme ops '' exceeds expectations ."]

%%time is the IPython magic to measure execution time of a cell. Of course, using such of small sample data, the multiprocessing actually runs slower due to overhead of creating new process.

Anyway, using Pandas.DataFrame you could just convert the column/Series to list by list() as below instead of iterating through it, which is much more efficient.

list(df.text)

Below is the comparison of performance on using list() instead of iterating it through like how you did. The total is 197 µs vs 564 µs.

In [52]: %%time 
    ...: [s[i] for i in range(len(s))] 
    ...:  
    ...:                                                                                                
CPU times: user 499 µs, sys: 65 µs, total: 564 µs
Wall time: 506 µs
Out[52]: 
["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]

In [53]: %%time 
    ...: list(s) 
    ...:  
    ...:                                                                                                
CPU times: user 174 µs, sys: 23 µs, total: 197 µs
Wall time: 215 µs
Out[53]: 
["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]
Darren Christopher
  • 3,893
  • 4
  • 20
  • 37
  • I applied both multiprocessing, as shown in your code, and the `df.apply` method shown in the other answer on my original dataset with 8500 rows. The results are quite interesting - the multiprocessing method took 13.04 seconds while `df.apply` took 1.37 seconds. – Kristada673 Sep 10 '19 at 06:43