How can I Optimize a search in pandas dataframe

Question

I need to search the word 'mas' in Dataframe, the column with frase is Corpo, and the text in this column is splitted in list, for example: I like birds ---> split [I,like,birds]. So, I need search 'mas' in a portuguese frase and catch just the words after 'mas'. The code is taking to long to execute this function.

df.Corpo.update(df.Corpo.str.split()) #tokeniza frase
df.Corpo = df.Corpo.fillna('') 

for i in df.index:
  for j in range(len(df.Corpo[i])):
    lista_aux = []

    if df.Corpo[i][j] == 'mas' or df.Corpo[i][j] == 'porem' or df.Corpo[i][j] == 'contudo' or df.Corpo[i][j] == 'todavia':
        lista_aux = df.Corpo[i]
        df.Corpo[i] = lista_aux[j+1:]
        break

    if df.Corpo[i][j] == 'question':
        df.Corpo[i] = ['question']
        break

https://stackoverflow.com/questions/26640129/search-for-string-in-all-pandas-dataframe-columns-and-filter — Rahul Agarwal, Sep 18 '18 at 14:34
Welcome to StackOverflow! Please provide an example input and expected output in your question. You can read about [how to ask a question](https://stackoverflow.com/help/how-to-ask) (particularly [how to create a good example](https://stackoverflow.com/help/mcve)) in order to get good responses. — Alex, Sep 18 '18 at 14:34

Xukrao · Accepted Answer · 2018-09-18T17:29:18.280

0

When working with pandas dataframes (or numpy arrays) you should always try to use vectorized operations instead of for-loops over individual dataframe elements. Vectorized operations are (nearly always) significantly faster than for-loops.

In your case you could use pandas built-in vectorized operation str.extract, which allows extraction of the string part that matches a regex search pattern. The regex search pattern mas (.+) should capture the part of a string that follows after 'mas'.

import pandas as pd

# Example dataframe with phrases
df = pd.DataFrame({'Corpo': ['I like birds', 'I mas like birds', 'I like mas birds']})

# Use regex search to extract phrase sections following 'mas'
df2 = df.Corpo.str.extract(r'mas (.+)')

# Fill gaps with full original phrase
df2 = df2.fillna(df.Corpo)

will give as result:

In [1]: df2
Out[1]:
              0
0  I like birds
1    like birds
2         birds

edited Sep 18 '18 at 17:29

answered Sep 18 '18 at 14:41

Xukrao

8,003
5
26
52

In case that the phrase don't have the word 'mas', how can I add the original phrase instead the 'NaN'? – Mariana Sep 18 '18 at 17:08
With a [`fillna`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) operation for example. See edited answer. – Xukrao Sep 18 '18 at 17:27
Thanks Xukrao you helped me so much! – Mariana Sep 18 '18 at 17:33

How can I Optimize a search in pandas dataframe

1 Answers1