How to filter a data.frame by matching a character string?

Question

I have those structures:

coop <- structure(list(razao_social = c("COOPERATIVA AGROINDUSTRIAL ALFA", 
"COOPERATIVA MISTA DOS PRODUTORES DE LEITE DE MORRINHOS", "COOPERATIVA DE TRABALHO DOS PRESTADORES DE SERVICOS EM CONDOMINIOS - CONDOMINIUN", 
"COOPERATIVA EMPRESARIAL RURAL DE SAPUCARANA - COERSA", "COOPERATIVA AGROINDUSTRIAL ALFA", 
"COASUL COOPERATIVA AGROINDUSTRIAL", "COOPERATIVA DOS PLANTADORES DE CANA DA ZONA LENCOIS PTA", 
"COOPERATIVA AGROPECUARIA CENTRO SERRANA", "COOPERATIVA AGROINDUSTRIAL COOPERJA", 
"COOPERATIVA DOS CAFEICULTORES DA REGIAO DE GARCA", "COOPERATIVA DA CONSTRUCAO CIVIL DO ESTADO DE SERGIPE", 
"COOPERATIVA REG AGRO IND DE S DOMINGOS DO PRATA LTDA", "COOPERATIVA REG AGRO IND DE S DOMINGOS DO PRATA LTDA", 
"COOPERATIVA AGROPECUARIA VIDEIRENSE", "COOPERATIVA REGIONAL AGROPECUARIA VALE DO ITAJAI","COOPERATIVA DE TRABALHO E PRODUCAO ESPERACA"
), cnae_fiscal = c(4623109L, 4789099L, 9430800L, 9430800L, 4623108L, 
4623199L, 4789099L, 4789099L, 4789099L, 4789099L, 4789099L, 4633801L, 
4633801L, 4632001L, 4632001L,9430800L)), row.names = c(NA, 16L), class = "data.frame")


keywords <- c("abacaxi", "abate", "abatidas", "acabamento", "acácia", "açaí", 
"açúcar", "adoçantes", "adubos", "agentes", "aglomerada", 
"agr", "água", "aguardente", "aguardentes", "águas", "álcalis", 
"álcool", "alcoólicas", "algodão", "alho", "alimentação", 
"alimentares", "alimentícias", "alimentícios", "alimentos", 
"alopáticos", "ambiental", "amendoim", "amidos", "amiláceos", 
"animais", "animal", "aparelhos", "apicultura", "aquáticos", 
"aquicultura", "aqüicultura", "ar", "arroz", "artificiais", 
"árvores", "asininos", "atacadista", "atacado", "aves", "baía", 
"balas", "bambu", "banana", "banho", "base", "batata", "bebidas", 
"beneficiamento", "beterraba", "bicho", "biocombustíveis", "biscoitos", 
"bolachas", "bolos", "bolsas", "bombons", "bordar", "borracha", 
"bovinas", "bovinos", "bufalinos", "caça", "cacau", "café", 
"cafeeiro", "cafeicultor", "caju", "calçados", "cama", "camarões", 
"caminhões", "cana", "caprinos", "carburante", "carne", "carnes", 
"carpintaria", "carrocerias", "cartão", "cartolina", "carvão", 
"casas", "casca", "castanha", "cebola", "celulose", "celulósicas", 
"cereais", "cerveja", "cervejas", "chá", "charutos", "chás", 
"chocolates", "chope", "chopes", "cigarrilhas", "cigarros", "cítricos", 
"cloro", "coco", "coelhos", "colchões", "colheita", "combustíveis", 
"comerciais", "comestíveis", "concentrados", "condimento", "confeccionadas", 
"confeitaria", "confeitos", "conservação", "conservas", "contrato", 
"controle", "cordoaria", "corte", "cortiça", "costurar", "couro", 
"couros", "cristalizadas", "crustáceos", "cultivo", "cultivos", 
"curtimento", "defensivos", "dendê", "desdobramento", "desinfestantes", 
"destiladas", "dextrose", "dietéticos", "doce", "domissanitários", 
"equinos", "erva", "escargô", "especiarias", "espécies", "especificadas", 
"estamparia", "estimação", "estruturas", "eucalipto", "extração", 
"farinha", "farinhas", "farmacêuticas", "farmacêuticos", "farmoquímicos", 
"féculas", "feijão", "fermentos", "ferramenta", "fertilizantes", 
"fibras", "fios", "fitoterápicos", "flores", "florestais", "florestal", 
"florestas", "floricultura", "folha", "formação", "formulários", 
"forrageiras", "frangos", "frescos", "frigorífico", "frutas", 
"frutos", "fumo", "galináceos", "gases", "gelados", "gelo", 
"girassol", "gorduras", "gramas", "grão", "guaraná", "herbáceo", 
"higiênico", "higiênicos", "homeopáticos", "hortaliças", 
"horticultura", "hortifrutigranjeiros", "humano", "índia", "inglesa", 
"inseminação", "íntimas", "irrigação", "isotônicas", "jacaré", 
"lã", "laminada", "laminados", "laranja", "lãs", "látex", 
"laticínios", "lavoura", "lavouras", "legumes", "leguminosas", 
"leite", "leveduras", "linhas", "maçã", "madeira", "madeireiras", 
"madeireiros", "malas", "malha", "malharia", "malharias", "malte", 
"mamão", "mamona", "mandioca", "manejo", "manga", "mar", "maracujá", 
"margarina", "marinhos", "massas", "matadouro", "mate", "matérias", 
"medicamentos", "meias", "melancia", "melão", "mercadorias", 
"mexilhões", "milho", "mineral", "moagem", "moído", "molhos", 
"moluscos", "morango", "muares", "mudas", "nativas", "naturais", 
"oleaginosas", "óleo", "ondulado", "orgânicos", "organo", "origem", 
"ornamentais", "ostras", "ovinos", "ovos", "padaria", "pães", 
"palha", "palmito", "panificação", "papéis", "papel", "papelão", 
"pará", "pasto", "pecuária", "peixes", "peles", "pequenos", 
"permanente", "pesca", "pescado", "pescados", "pêssego", "pimenta", 
"pintos", "pinus", "plantadas", "plantas", "poda", "porte", "pratos", 
"produtos", "pulverização", "químicos", "raízes", "ranicultura", 
"rasteiro", "recreativos", "refinados", "refrescos", "refrigerante", 
"refrigerantes", "reino", "reses", "resseragem", "roupas", "rural", 
"salgada", "salobra", "seda", "sementes", "semicultivos", "seringueira", 
"serrarias", "serviço", "sisal", "soja", "solo", "solúvel", 
"sorvetes", "sucos", "suínas", "suínos", "tanoaria", "tapeçaria", 
"teca", "tecelagem", "tecidos", "temperos", "tênis", "termofixas", 
"termoplásticas", "têxteis", "têxtil", "texturização", "tomate", 
"torção", "torrado", "torrefação", "tosquiamento", "trançado", 
"trançados", "tratamento", "tricotagem", "tricotagens", "trigo", 
"tubérculos", "tubulares", "uísque", "uva", "vegetais", "vegetal", 
"veículos", "verduras", "vestuário", "veterinário", "vime", 
"vinagres", "vinho", "vivas", "viveiros", "vivos")

I want a data.frame called coop_filter where I will have the columns from coop with rows which $razao_social has words from keywords.

Note:

coop is really big, so I would like to delete the row after the transfer. I think this will also avoid duplication.
In keywords I have the word "agr". With this word I want to filter all words which begins with "agr" like AGROPECUARIA, AGRICOLA, etc.

I have been trying something like this:

library(tidyverse)

coop$razao_social <- tolower(coop$razao_social)

lixo <- c("-",",",".","(",")","  ",";",":")
for(i in 1:length(lixo)){
  coop$razao_social <- str_replace_all(coop$razao_social, fixed(lixo[i]), " ") 
  ifelse(i!=length(lixo),print(i),print(i) & rm(lixo,i))
}


# Filter

coop_filter <- data.frame()

for(i in length(keywords)){
  
  eval(parse(text=paste0("
  
                 coop_filter <- str_extract_all(coop$razao_social, '^.+",keywords[i],".+')
  
  ")))
  
}

I'm already stucked here with an empty coop_filter.

Try: `coop$index <- apply(coop[,1,drop=F],1,function(x) grepl(pattern = toupper(keywords),x = x))` and see if is what you want. I do not have a clear idea about the final output. — Duck, Jul 07 '20 at 21:51
It's a nice way to think about it! Thanks! But it returned everything as FALSE. — RxT, Jul 07 '20 at 22:01
Do you want the complete row where at least one word of `keywords` is present or you want to extract `keywords`.Can you show your expected output? — Ronak Shah, Jul 08 '20 at 00:29

nniloc · Accepted Answer · 2020-07-07T22:59:57.973

1

Using the solution from this answer.

coop %>%
  filter(str_detect(razao_social, paste(toupper(keywords), collapse = "|")))

#-----
                                              razao_social cnae_fiscal
1                          COOPERATIVA AGROINDUSTRIAL ALFA     4623109
2                          COOPERATIVA AGROINDUSTRIAL ALFA     4623108
3                        COASUL COOPERATIVA AGROINDUSTRIAL     4623199
4  COOPERATIVA DOS PLANTADORES DE CANA DA ZONA LENCOIS PTA     4789099
5                  COOPERATIVA AGROPECUARIA CENTRO SERRANA     4789099
6                      COOPERATIVA AGROINDUSTRIAL COOPERJA     4789099
7         COOPERATIVA DOS CAFEICULTORES DA REGIAO DE GARCA     4789099
8     COOPERATIVA REG AGRO IND DE S DOMINGOS DO PRATA LTDA     4633801
9     COOPERATIVA REG AGRO IND DE S DOMINGOS DO PRATA LTDA     4633801
10                     COOPERATIVA AGROPECUARIA VIDEIRENSE     4632001
11        COOPERATIVA REGIONAL AGROPECUARIA VALE DO ITAJAI     4632001

edited Jul 07 '20 at 22:59

answered Jul 07 '20 at 22:52

nniloc

4,128
2
11
22

I'm not sure I understand the first note in your question, but this should handle the main question and the second part. – nniloc Jul 07 '20 at 23:41
This worked nice! easier than I thought! But in my real data, it filter some `$razao_social` like "cooperativa de trabalho e producao esperanca". I don't have any of these words in my `keyword`. – RxT Jul 08 '20 at 04:11
I noticed I have the word "produtos", similar to "producao", but not the same. Could be it? In some cases this can help to find a word, but now, this is a odd. (i noticed this happening with another words too) – RxT Jul 08 '20 at 04:24
Similar to the way you use `agr` to match `agricola`, maybe there is a `keyword` which matches part of both `produtos` and `producao`? If you provide your full `keyword` list or an example of the issue we might be able to tell what is going on. – nniloc Jul 08 '20 at 14:43
Yes, `produtos` is in my `keywords`. But I don't have `prod`, for example. – RxT Jul 08 '20 at 14:50
Just add my entire `keywords` and `cooperativa de trabalho e producao esperanca` to reproduce the exemple. – RxT Jul 08 '20 at 14:57
It is the keyword `alho` which is matching with `trabalho`. Try running `sapply(toupper(keywords), function(x){str_detect("COOPERATIVA DE TRABALHO E PRODUCAO ESPERACA", x)}) ` to see which words match. – nniloc Jul 08 '20 at 15:50
1

If this answered your question would you mind selecting the answer? Thanks! – nniloc Jul 09 '20 at 15:58
Sorry! Thank you! If I want only the exact word, what should I change ? – RxT Jul 10 '20 at 04:14
Might be a better way, but you could try adding a space after the "or" separator, e.g. `collapse = "| "`. This way the pattern to match is a space, then the `keyword` – nniloc Jul 10 '20 at 16:01
1

Didn't work because it don't get the end of the string (`$`). I solve it adding `\\bkeyword\\b` . And did the filter by `agr` separated. Thanks a lot for your help! – RxT Jul 10 '20 at 17:38

How to filter a data.frame by matching a character string?

1 Answers1