2

I need to search the word 'mas' in Dataframe, the column with frase is Corpo, and the text in this column is splitted in list, for example: I like birds ---> split [I,like,birds]. So, I need search 'mas' in a portuguese frase and catch just the words after 'mas'. The code is taking to long to execute this function.

df.Corpo.update(df.Corpo.str.split()) #tokeniza frase
df.Corpo = df.Corpo.fillna('') 

for i in df.index:
  for j in range(len(df.Corpo[i])):
    lista_aux = []

    if df.Corpo[i][j] == 'mas' or df.Corpo[i][j] == 'porem' or df.Corpo[i][j] == 'contudo' or df.Corpo[i][j] == 'todavia':
        lista_aux = df.Corpo[i]
        df.Corpo[i] = lista_aux[j+1:]
        break

    if df.Corpo[i][j] == 'question':
        df.Corpo[i] = ['question']
        break
Mariana
  • 165
  • 1
  • 1
  • 8
  • https://stackoverflow.com/questions/26640129/search-for-string-in-all-pandas-dataframe-columns-and-filter – Rahul Agarwal Sep 18 '18 at 14:34
  • Welcome to StackOverflow! Please provide an example input and expected output in your question. You can read about [how to ask a question](https://stackoverflow.com/help/how-to-ask) (particularly [how to create a good example](https://stackoverflow.com/help/mcve)) in order to get good responses. – Alex Sep 18 '18 at 14:34

1 Answers1

0

When working with pandas dataframes (or numpy arrays) you should always try to use vectorized operations instead of for-loops over individual dataframe elements. Vectorized operations are (nearly always) significantly faster than for-loops.

In your case you could use pandas built-in vectorized operation str.extract, which allows extraction of the string part that matches a regex search pattern. The regex search pattern mas (.+) should capture the part of a string that follows after 'mas'.

import pandas as pd

# Example dataframe with phrases
df = pd.DataFrame({'Corpo': ['I like birds', 'I mas like birds', 'I like mas birds']})

# Use regex search to extract phrase sections following 'mas'
df2 = df.Corpo.str.extract(r'mas (.+)')

# Fill gaps with full original phrase
df2 = df2.fillna(df.Corpo)

will give as result:

In [1]: df2
Out[1]:
              0
0  I like birds
1    like birds
2         birds
Xukrao
  • 8,003
  • 5
  • 26
  • 52