In pandas how to search for words and phrases to create new dataframe?

Question

In Python3 and pandas I have this dataframe:

bens_gerais_candidatos_2014.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6400 entries, 0 to 6399
Data columns (total 12 columns):
uf_x               6400 non-null object
cargo              6400 non-null object
nome_completo      6400 non-null object
sequencial         6400 non-null object
cpf                6400 non-null object
nome_urna          6400 non-null object
partido_eleicao    6400 non-null object
situacao           6400 non-null object
uf_y               6400 non-null object
descricao          6400 non-null object
detalhe            6400 non-null object
valor              6400 non-null float64
dtypes: float64(1), object(11)
memory usage: 650.0+ KB

I need to select the rows that have in the "detalhe" column the words or phrases: "LOTE RURAL" or "FAZENDA" or "IMOVEL RURAL" or "GLEBA" or "AREA RURAL" or "AREA NO LOTEAMENTO"

Initially I thought about selecting each part:

mask = bens_gerais_candidatos_2014['detalhe'].str.contains("LOTE RURAL", na=False)
parte1 = bens_gerais_candidatos_2014[mask]

mask = bens_gerais_candidatos_2014['detalhe'].str.contains("FAZENDA", na=False)
parte2 = bens_gerais_candidatos_2014[mask]

And so on. And then merge these lines with a few merge:

areas1 = pd.merge(parte1, parte2, left_on='cpf', right_on='cpf', how='outer')
areas2 = pd.merge(areas1, parte3, left_on='cpf', right_on='cpf', how='outer')

...

Please, is there another easier way to look up words and phrases to create a new dataframe?

Without repeating lines - for example, there are cases where "LOTE RURAL" appears in a single line and others in which "LOTE RURAL" appears along with "FAZENDA", or cases that only appear "FAZENDA". Like this:

"LOTE RURAL 42"
"LOTE RURAL 38, DENOMINADO FAZENDA CATARINA"
"FAZENDA ÁGUA VERMELHA"

score 2 · Accepted Answer · answered May 22 '18 at 18:36

I think you can do:

str_choice = "LOTE RURAL|FAZENDA|IMOVEL RURAL" 
bens_gerais_candidatos_2014[bens_gerais_candidatos_2014['detalhe'].\
                               str.contains(str_choice, na=False)]

The symbol | means "or" in str_choice so it could get all the different words you look for, add as much | you need

score 2 · Answer 2 · answered May 22 '18 at 18:38

2

You can try below code:

search_list = ["LOTE RURAL","FAZENDA","IMOVEL RURAL","GLEBA","AREA RURAL","AREA NO LOTEAMENTO"]

mask = bens_gerais_candidatos_2014['detalhe'].str.contains('|'.join(search_list))

answered May 22 '18 at 18:38

harvpan

8,571
2
18
36

In pandas how to search for words and phrases to create new dataframe?

2 Answers2