5

In Python3 and pandas I have the dataframe:

df_projetos_api_final.info()                                                      
<class 'pandas.core.frame.DataFrame'>                                                      
Int64Index: 93631 entries, 1 to 93667                                                      
Data columns (total 21 columns):                                                           
AnoMateria                       93631 non-null object                                     
CodigoMateria                    93631 non-null object                                     
DescricaoIdentificacaoMateria    93631 non-null object                                     
DescricaoSubtipoMateria          93631 non-null object                                     
IndicadorTramitando              93631 non-null object                                     
NomeCasaIdentificacaoMateria     93631 non-null object                                     
NumeroMateria                    93631 non-null object                                     
ApelidoMateria                   891 non-null object                                       
DataApresentacao                 93631 non-null object                                     
DataLeitura                      54213 non-null object                                     
EmentaMateria                    93631 non-null object                                     
ExplicacaoEmentaMateria          9461 non-null object                                      
IndicadorComplementar            93631 non-null object                                     
DescricaoNatureza                54352 non-null object                                     
NomeAutor                        93100 non-null object                                     
IndicadorOutrosAutores           93214 non-null object                                     
CodigoParlamentar                49786 non-null object                                     
NomeParlamentar                  49786 non-null object                                     
NomeCompletoParlamentar          49786 non-null object                                     
UfParlamentar                    45613 non-null object                                     
DescricaoSituacao                78783 non-null object                                     
dtypes: object(21)                                                                         
memory usage: 8.2+ MB 

The column "EmentaMateria" has in each row a series of sentences. I plan to create a new dataframe from rows that contain any or several of these words (or group of words) in this column:

str_choice = "MULHER|MULHERES|TRABALHO DOMESTICO|VIOLENCIA CONTRA A MULHER|VIOLENCIA DOMESTICA|VIOLENCIA DE GENERO|MARIA DA PENHA|ABORTO|ABORTAMENTO|INTERRUPCAO DE GRAVIDEZ|INTERRUPCAO DE GESTACAO|DIREITO REPRODUTIVO|DIREITOS REPRODUTIVOS|DIREITO A VIDA|CONCEPCAO|CONTRACEPCAO|CONTRACEPTIVO|MISOPROSTOL|MIFEPRISTONE|CYTOTEC|UTERO|GESTACAO|GRAVIDEZ|PARTO|VIOLENCIA OBSTETRICA|FETO|BEBE|CRIANCA|VIOLENCIA SEXUAL|FEMINICIDIO|MORTE DE MULHER|MORTE DE MULHERES|HOMICIDIO DE MULHER|HOMICIDIO DE MULHERES|ASSEDIO SEXUAL|ASSEDIO|ESTUPRO|VIOLENCIA SEXUAL|ABUSO SEXUAL|ESTUPRO DE VULNERAVEL|LICENCA MATERNIDADE|FEMININO|MULHER NEGRA|MULHERES NEGRAS|MULHERES QUILOMBOLAS|MULHERES INDIGENAS|NEGRAS|NEGRA|RACISMO|RACA|RACIAL|ABUSO SEXUAL|MATERNIDADE|MAE|AMAMENTACAO|SEXUALIDADE|SEXO|GENERO|FEMINISMO|MACHISMO|GUARDA DE FILHOS|GUARDA DOS FILHOS|IGUALDADE DE GENERO|IDENTIDADE DE GENERO|IDEOLOGIA DE GENERO|EDUCACAO SEXUAL|ESCOLA SEM PARTIDO|TRANSEXUAL|TRANSEXUALIDADE|MULHER TRANS|MULHERES TRANS|MUDANCA DE SEXO|READEQUACAO SEXUAL|EXPLORACAO SEXUAL|PROSTITUICAO|ORIENTACAO SEXUAL|HOMOSSEXUAL|HOMOSSEXUALIDADE|HOMOSSEXUALISMO|LESBICA|LESBICAS|DIREITO DOS HOMENS|EDUCACAO RELIGIOSA|DEUS|RELIGIAO|EDUCACAO DOMICILIAR|HOMESCHOOLING|CRECHE|EDUCACAO INFANTIL|CASAMENTO INFANTIL"  

So I did like this:

seleciona2 = df_projetos_api_final [df_projetos_api_final['EmentaMateria'].\
                               str.contains(str_choice, na=False)]

The newly generated dataframe has collected multiple sentences that have one or more of these words. However, many lines do not have these words, as
"ENCAMINHA AO SENADO FEDERAL, UM ADENDO AS SUGESTOES DE EMENDAS A
PROPOSTA ORCAMENTARIA DO DISTRITO FEDERAL, REFERENTE A ALTERACAO DO
PROGRAMA DE TRABALHO DO FUNDEPE - FUNDO DE DESENVOLVIMENTO DO
DISTRITO FEDERAL, VISANDO A ACRESCENTAR MAIS CZ 3.453.977.000,00
(TRES BILHOES, QUATROCENTOS E CINQUENTA E TRES MILHOES, NOVECENTOS E
SETENTA E SETE MIL CRUZADOS) AO PROJETO DE EXECUCAO DE OBRAS E
EQUIPAMENTOS DO SISTEMA DE EDUCACAO E CULTURA."

Please, is this because similar words are also being searched for in sentences? Or because many sentences have too much white space or line breaks between some words?

Edit 7/12/2019

Thanks so much all friends for your attention. After you wrote I went back to reviewing the database and again the suggested codes. I got the original database, with accents in Brazilian Portuguese. I think this is the original problem - I didn't know the original base had been changed

I found that the database I was working on had gone through unidecode to remove accents from Portuguese. So I repeated the tests with a str_choice with accents and with the original database, and then it worked - I haven't checked all the lines yet, but all I've seen are correct so far

So the new str_choice (I used the name search_list), I used was this:

df_projetos_api_final['EmentaMateria'] = df_projetos_api_final['EmentaMateria'].str.upper()
search_list = ["MULHER", "MULHERES", "TRABALHO DOMÉSTICO", "VIOLÊNCIA CONTRA A MULHER", "VIOLÊNCIA DOMÉSTICA", "VIOLÊNCIA DE GÊNERO", "MARIA DA PENHA", "ABORTO", "ABORTAMENTO", "INTERRUPÇÃO DE GRAVIDEZ", "INTERRUPÇÃO DE GESTAÇÃO", "DIREITO REPRODUTIVO", "DIREITOS REPRODUTIVOS", "DIREITO À VIDA", "CONCEPÇÃO", "CONTRACEPÇÃO", "CONTRACEPTIVO", "MISOPROSTOL", "MIFEPRISTONE", "CYTOTEC", "ÚTERO", "GESTAÇÃO", "GRAVIDEZ", "PARTO", "VIOLÊNCIA OBSTÉTRICA", "FETO", "BEBÊ", "CRIANÇA", "VIOLÊNCIA SEXUAL", "FEMINICÍDIO", "MORTE DE MULHER", "MORTE DE MULHERES", "HOMICÍDIO DE MULHER", "HOMICÍDIO DE MULHERES", "ASSÉDIO SEXUAL", "ASSÉDIO", "ESTUPRO", "VIOLÊNCIA SEXUAL", "ABUSO SEXUAL", "ESTUPRO DE VULNERÁVEL", "LICENÇA MATERNIDADE", "FEMININO", "MULHER NEGRA", "MULHERES NEGRAS", "MULHERES QUILOMBOLAS", "MULHERES INDÍGENAS", "NEGRAS", "NEGRA", "RACISMO", "RAÇA", "RACIAL", "ABUSO SEXUAL", "MATERNIDADE", "MÃE", "AMAMENTAÇÃO", "SEXUALIDADE", "SEXO", "GÊNERO", "FEMINISMO", "MACHISMO", "GUARDA DE FILHOS", "GUARDA DOS FILHOS", "IGUALDADE DE GÊNERO", "IDENTIDADE DE GÊNERO", "IDEOLOGIA DE GÊNERO", "EDUCAÇÃO SEXUAL", "ESCOLA SEM PARTIDO", "TRANSEXUAL", "TRANSEXUALIDADE", "MULHER TRANS", "MULHERES TRANS", "MUDANÇA DE SEXO", "READEQUAÇÃO SEXUAL", "EXPLORAÇÃO SEXUAL", "PROSTITUIÇÃO", "ORIENTAÇÃO SEXUAL", "HOMOSSEXUAL", "HOMOSSEXUALIDADE", "HOMOSSEXUALISMO",  "LÉSBICA",  "LÉSBICAS",  "DIREITO DOS HOMENS", "EDUCAÇÃO RELIGIOSA",  "DEUS", "RELIGIÃO", "EDUCACÃO DOMICILIAR", "HOMESCHOOLING", "CRECHE",  "EDUCAÇÃO INFANTIL",  "CASAMENTO INFANTIL"]
mask = df_projetos_api_final['EmentaMateria'].str.contains('|'.join(search_list))
seleciona = df_projetos_api_final[mask]
seleciona.info()
Reinaldo Chaves
  • 965
  • 4
  • 16
  • 43
  • 1
    Could you share at least part of the data in a format which is easy to use? – AMC Dec 06 '19 at 02:01
  • 1
    Few more questions: Can you share the entire data set, for testing and benchmarking? Are there line breaks in some of the strings? Where do the substrings you need to match come from? – AMC Dec 06 '19 at 04:52
  • 1
    Also, do you need the matches, or only to know whether there was one? – AMC Dec 06 '19 at 05:01
  • Thanks so much @Alexander Cécile and all the other friends for your attention. After you wrote I went back to reviewing the database and again the suggested codes. I got the original database, with accents in Brazilian Portuguese. I think this is the original problem – Reinaldo Chaves Dec 07 '19 at 08:29
  • I found that the database I was working on had gone through unidecode to remove accents from Portuguese. So I repeated the tests with a str_choice with accents, and then it worked. I will edit the question with this new information and the new str_choice I used and also link to the original base – Reinaldo Chaves Dec 07 '19 at 08:30
  • The original database @Alexander Cécile: https://drive.google.com/file/d/1AGa4FveEi-ev_oEOinsj1s02QbxMbwzT/view?usp=sharing – Reinaldo Chaves Dec 07 '19 at 08:34
  • 1
    Alright, I’ll edit my solution later to include the new `str_choice`. Do you need the matches or not, in the end? Also I’m not sure if you mean the accents are AN issue, or THE issue. I think the regex should be a problem, regardless of the accents/encoding. – AMC Dec 07 '19 at 09:29
  • I need to know if the words or word sets (example -"TRABALHO DOMÉSTICO") that are in search_list exist in column 'EmentaMateria'. If 'EmentaMateria' content has at least one of search_list items – Reinaldo Chaves Dec 07 '19 at 10:15
  • 1
    Alright, but you don’t need to know which word or words it is, correct? – AMC Dec 07 '19 at 10:27
  • For example, create a new column with the word or word set that was found? I don't need to, but it's a good suggestion to help check if the selection was correct – Reinaldo Chaves Dec 07 '19 at 10:32
  • 1
    Yeah, that’s what I meant. Good to know that you don’t need to do that! – AMC Dec 07 '19 at 10:39
  • 1
    Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203820/discussion-between-alexander-cecile-and-reinaldo-chaves). – AMC Dec 07 '19 at 21:53

4 Answers4

4

Diagnosis

First, let's take a look at why your code doesn't work. @jorijnsmit gave it away (and shared a useful answer), your regex matches characters regardless of where they are. Let us illustrate with a simpler example, which I will be using throughout:

We want to match the words 'app' and 'he', so we construct a regex much like yours.

strings_to_match = ['app', 'he']

match_pattern = '|'.join(strings_to_match) # "app|he"

We join the strings we want to match using the alternation operator and we're good to go, right? Thanks to the magic of regex101, here are the results of applying our pattern to a few strings (matches are in square brackets):

  • [he]llo
  • brot[he]r
  • [app]lication
  • [he]
  • [app]le
  • h[app]ier
  • [app]
  • [he]ll

Our pattern matches the strings 'app' and 'he' anywhere, when we want just the words themselves!


RegEx Solution

What can we do to fix this? Our first thought may be to change our pattern to ' app | he ', which does fix the issue with strings like 'application'. Unfortunately, this isn't foolproof. That pattern fails to recognize the word 'app' in 'I downloaded an app.', which is perfectly valid to us. Fortunately for us, regex has just the solution we need: Word boundaries, represented by the token '\b', whose principle is rather self-explanatory.

Here are some results for the new pattern, '\bapp\b|\bhe\b':

  • '[he]'
  • 'apple'
  • 'happier'
  • ' [app] '
  • 'hell'
  • 'I downloaded an [app]!'

Exactly what we expect! While it does work correctly, that pattern is needlessly difficult to read. We can use only one set of word boundary tokens by putting all our substrings into a non-capturing group: '\b(?:app|he)\b'. A capturing group, well, groups and captures a subset of a regex. In this case, the group would return the same thing as the entire match. A non-capturing group eliminates that redundancy while still allowing us to logically separate part of our expression.

Here is a complete program which demonstrates constructing the pattern and using it on a Pandas Series:

import pandas as pd

test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
             'I downloaded an app.']

test_series = pd.Series(data=test_strs)

strings_to_match = ['app', 'he']

match_pattern = fr"\b(?:{'|'.join(strings_to_match)})\b"  # "\b(?:app|he)\b"

match_res = test_series.str.contains(match_pattern, case=False)

Output of print(match_res):

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8     True
9     True
dtype: bool

A few notes on other solutions

1.

Please note that these methods can only match words, not arbitrary substrings. As such, they are not actually valid solutions for this particular problem, and are only covered here for the sake of completeness.

This is the same style as the solution by @FBruzzesi, which we will call version 1. For reference:

# Convert string into list of strings
str_list = str_choice.split(|)

# Control if any word is in the sentence after splitting the sentence by space
df['has_match'] = df.apply(lambda r: [x for x in str_list if x in r['EmentaMateria'].split(' ')], axis=1)

#This will create a list of words you find, then you can filter only those which has a match
df = df[df.apply(lambda r: len(r['has_match'])>0, axis=1)]

While their solution collects all the matches, we are only concerned with whether or not there is a match in. Let's look at a refactored version of their solution, version 2:

import pandas as pd

test_strs = ['hello', 'brother', 'application', 'he', 'apple', 'happier', 'app', 'hell', ' app ',
             'I downloaded an app.']

test_series = pd.Series(data=test_strs)

strings_to_match = ['app', 'he']

series_split = test_series.str.split()

match_res = series_split.map(lambda curr_words: any((curr_sub in curr_words for curr_sub in strings_to_match)))

Unlike version 1, version 2 keeps the number of split() operations, which were a real cause for concern (I estimated around 8,000,000 split() operations for the entire column), to a minimum. It should also be more efficient since the number of iterations depends on the number of substrings to match, which should often be lower than the number of words in the string to check.

2.

I have seen a few mentions of the regex parameter which I feel are unclear or misleading. Yes, passing regex=False will match a literal string, no, just changing the parameter won't make your current code work (why would it?).


I hope this is the sort of thing you had in mind when you asked for a canonical answer. Let me know if anything is unclear or you have any further questions :)

Community
  • 1
  • 1
AMC
  • 2,642
  • 7
  • 13
  • 35
3

The docs for .contains() mention you could use the stricter .match() instead as it is based on re.match instead of re.search.

For an explanation between the two see for example this thread: What is the difference between re.search and re.match?.

Edit: Just for kicks I tried finding out which pattern was matched exactly:

str_choice = "MULHER|MULHERES|TRABALHO DOMESTICO|VIOLENCIA CONTRA A MULHER|VIOLENCIA DOMESTICA|VIOLENCIA DE GENERO|MARIA DA PENHA|ABORTO|ABORTAMENTO|INTERRUPCAO DE GRAVIDEZ|INTERRUPCAO DE GESTACAO|DIREITO REPRODUTIVO|DIREITOS REPRODUTIVOS|DIREITO A VIDA|CONCEPCAO|CONTRACEPCAO|CONTRACEPTIVO|MISOPROSTOL|MIFEPRISTONE|CYTOTEC|UTERO|GESTACAO|GRAVIDEZ|PARTO|VIOLENCIA OBSTETRICA|FETO|BEBE|CRIANCA|VIOLENCIA SEXUAL|FEMINICIDIO|MORTE DE MULHER|MORTE DE MULHERES|HOMICIDIO DE MULHER|HOMICIDIO DE MULHERES|ASSEDIO SEXUAL|ASSEDIO|ESTUPRO|VIOLENCIA SEXUAL|ABUSO SEXUAL|ESTUPRO DE VULNERAVEL|LICENCA MATERNIDADE|FEMININO|MULHER NEGRA|MULHERES NEGRAS|MULHERES QUILOMBOLAS|MULHERES INDIGENAS|NEGRAS|NEGRA|RACISMO|RACA|RACIAL|ABUSO SEXUAL|MATERNIDADE|MAE|AMAMENTACAO|SEXUALIDADE|SEXO|GENERO|FEMINISMO|MACHISMO|GUARDA DE FILHOS|GUARDA DOS FILHOS|IGUALDADE DE GENERO|IDENTIDADE DE GENERO|IDEOLOGIA DE GENERO|EDUCACAO SEXUAL|ESCOLA SEM PARTIDO|TRANSEXUAL|TRANSEXUALIDADE|MULHER TRANS|MULHERES TRANS|MUDANCA DE SEXO|READEQUACAO SEXUAL|EXPLORACAO SEXUAL|PROSTITUICAO|ORIENTACAO SEXUAL|HOMOSSEXUAL|HOMOSSEXUALIDADE|HOMOSSEXUALISMO|LESBICA|LESBICAS|DIREITO DOS HOMENS|EDUCACAO RELIGIOSA|DEUS|RELIGIAO|EDUCACAO DOMICILIAR|HOMESCHOOLING|CRECHE|EDUCACAO INFANTIL|CASAMENTO INFANTIL"

df = pd.DataFrame(['ENCAMINHA AO SENADO FEDERAL, UM ADENDO AS SUGESTOES DE EMENDAS A PROPOSTA ORCAMENTARIA DO DISTRITO FEDERAL, REFERENTE A ALTERACAO DO PROGRAMA DE TRABALHO DO FUNDEPE - FUNDO DE DESENVOLVIMENTO DO DISTRITO FEDERAL, VISANDO A ACRESCENTAR MAIS CZ 3.453.977.000,00 (TRES BILHOES, QUATROCENTOS E CINQUENTA E TRES MILHOES, NOVECENTOS E SETENTA E SETE MIL CRUZADOS) AO PROJETO DE EXECUCAO DE OBRAS E EQUIPAMENTOS DO SISTEMA DE EDUCACAO E CULTURA.'.split()])

df.T[0][df.T[0].str.contains(str_choice)]

Returns:

18    ALTERACAO
Name: 0, dtype: object

This returns because it contains the substring 'RACA'. If you set regex=False this does not happen; it will look for the full string.

gosuto
  • 5,422
  • 6
  • 36
  • 57
  • Thank you very much @jorijnsmit. But it found the word "ALTERACAO", right? However, this word was not in the "str_choice" – Reinaldo Chaves Nov 29 '19 at 11:22
  • With this command .match selects no rows: seleciona3 = df_projetos_api_final [df_projetos_api_final['EmentaMateria'].\ str.match(str_choice, na=False)] – Reinaldo Chaves Nov 29 '19 at 11:28
  • Thank you again @jorijnsmit I saw your new comment now. But the regex=False I set in what way? I did like this, but I had error: seleciona3 = df_projetos_api_final [df_projetos_api_final['EmentaMateria'].\ str.match(str_choice, na=False, regex=False)] – Reinaldo Chaves Nov 29 '19 at 12:04
  • No using `.contains()` in this case. – gosuto Nov 29 '19 at 12:07
  • This way? seleciona3 = df_projetos_api_final [df_projetos_api_final['EmentaMateria'].str(str_choice, na=False, regex=False)] TypeError: 'StringMethods' object is not callable – Reinaldo Chaves Nov 29 '19 at 12:15
  • In my case I have a column that contains phrases and a list of words of interest. So I just want the sentences that have one or more of these words, all without regex. Maybe a better approach would be to go straight to the NLTK? – Reinaldo Chaves Nov 29 '19 at 12:27
  • No `df_projetos_api_final[df_projetos_api_final['EmentaMateria'].str.contains(str_choice, na=False, regex=False)]`. – gosuto Nov 29 '19 at 12:37
  • 1
    Your script is doing exactly that, the sentence that is returned contains `RACA` which you are looking for. Your problem is that you only want it to return something when `RACA` is a word by itself. You could write a function that regexes for ` RACA` OR `RACA `. A better option is definitely using nltk which will have better single word awareness. – gosuto Nov 29 '19 at 12:41
  • Thanks again @jorijnsmit but the above command still returns zero lines. And about NLTK I will do some tests, thanks – Reinaldo Chaves Nov 29 '19 at 13:14
2

For an exact match this workaround works:

# Convert string into list of strings
str_list = str_choice.split(|)

# Control if any word is in the sentence after splitting the sentence by space
df['has_match'] = df.apply(lambda r: [x for x in str_list if x in r['EmentaMateria'].split(' ')], axis=1)

#This will create a list of words you find, then you can filter only those which has a match
df = df[df.apply(lambda r: len(r['has_match'])>0, axis=1)]
FBruzzesi
  • 6,385
  • 3
  • 15
  • 37
  • This may work, but do you know why OP's code didn't? – AMC Dec 06 '19 at 01:58
  • @AlexanderCécile as jorijnsmit kinda explains the contains method tests if a pattern (or regex) is contained within a string Series. However a "contain" means a string is contained in another (e.g. RACA in ALTERACAO) , and since you are looking for an exact match this is not working as wanted. – FBruzzesi Dec 06 '19 at 08:32
  • Why was this accepted? It doesn't actually fit OP's question, since it can't match a substring composed of multiple words, no? – AMC Dec 07 '19 at 02:44
0

Possible Solution:

"Create a new dataframe from rows that contain any or several of these words (or group of words) in this column:"


Step 1: Split the column into a new DataFrame

# Assume that each 'sentence' is comma-separated.
df_split = df_projetos_api_final.EmentaMateria.split(",", expand=True)

Step 2: Search for sub-string

# Set regex=False if you want the entire word to match vs a sub-string. 
rows_that_match = df_split[df_split.str.contains(str_choice, case=False)]
Anthony R
  • 2,739
  • 1
  • 12
  • 11
  • _Assume that each 'sentence' is comma-separated._ That's a massive assumption, no? Ultimately it doesn't even matter, since that second step won't work. If `regex=False`, how do you think the `'|'` will be handled? You would have to iterate over the words to search for, which is the second issue, since OP literally says _contain any or several of these words (or group of words)_. – AMC Dec 07 '19 at 03:18
  • The point you make regarding the incompatibility of `regex=False` and the '|' comparator is a good one. I will remove the regex parameter to be safe. Whether or not a comma is the delimiter is a moot point though, there should be a consistent separator whatever the actual separator is. But perhaps for a sentence a comma is too common and shouldn't be used. Great points :) – Anthony R Dec 09 '19 at 15:54