Filter a dataframe based on a selection of string values found across multiple columns

Question

I have a huge database on replanting projects using different species of trees, and I want to create a new database selecting only the species I am interested in. I have ~70 words (i.e. species) I want to select from the dataframe, across 3 different columns. I'm trying to use the 'grepl' function, but I'm lost on adding multiple columns with the same selection of key words. The words/species can occur inconjunction with other species not targeted by my 70 words, not sure if that is the issue.

Essentially, I am trying to build code that finds any instance of the 70 words across the dataset, and selects them (or alternatively removes any row that does not include any of those 70), in order to avoid using command-f for 70+ words across a grand total of 16 datasets with thousands of rows.

Any help is much appreciated.

First I tried filtering the dataset with the 'grepl' function on the first column, called 'species' for the ~70 words, however it printed rows that do not include the 70 words/species. This is the output of the following:

> dput(head(NCR[,c("REGION", "COMPONENT","SPECIES")]))
structure(list(REGION = c("NCR", "NCR", "NCR", "NCR", "NCR", 
"NCR"), COMPONENT = c("Urban", "Urban", "Urban", "Urban", "Urban", 
"Urban"), SPECIES = c("Narra", "Banaba, Caballero, Ilang ilang, Molave, Yellow alder,Bougainvilla,", 
"Bignay, Camachile, Nangka, Sampaloc, Santol,Narra,kalumpit,langka,lipote,guyabano,palawan cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,bignay,molave", 
"Sansevieria, Spider lily, Yellow morado, Zigzag, Sansevieria, Spider lily, Yellow morado, Zigzag\nSansevieria, Spider lily, Yellow morado, Zigzag", 
"Banaba, Caballero, Ilang ilang, Narra, Tuai,", "Acacia, Acapulco, Antipolo, Bagras, Balete, Bougainvilla, Dao, Fire tree, Golden shower, Ipil, Kalumpit, Kamagong, Lipote, Manila palm, Molave, Nangka, Neem tree, Supa, Tuai, Yakal,mabolo,tabebuia,langka,bitaog,narracamachile,banaba,ilang\nilang,guyabano"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
key_terms <- c('mangrove','magrove','avicennia','bungalon','api-api','piapi','piape','miapi','myapi','miape','Rhizophora','bakau','Bakauan', 'bakaw','bakhaw','bacau','bacaw','Sonneratia','pagatpat','pedada','Nypa','nipa','nypa','sasa','Bruguiera','pototan','busain','langarai','Camptostemon','gapas','Ceriops','baras','tungog','tangal','Excoecaria','lipata','buta','Heritiera','dungon','Aegiceras','saging','Lumnitziera','tubao','culasi','kulasi','Osbornia','tawalis','bunot','Pemphis','bantigi','Scyphiphora','nilad','Xylocarpus','tabigi','tabige','piagao','piag-ao','tubo tubo','tubo-tubo','saging-saging','moluccensis','granatum','hydrophyllaceae','adicula','octodonta','corniculatum','littoralis','agallocha','tagal','decandra','philippinensis','parviflora','fruticans','caseolaris','ovata','alba' )
new_NCR <- filter(NCR, grepl(paste(key_terms, collapse='|'), SPECIES))
new_NCR

Please dont include your data as an image for [these reasons](https://meta.stackoverflow.com/a/285557/12109788). If you need help making a reproducible example, you can get [help here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — jpsmith, Jun 28 '23 at 16:57
@jpsmith Just added it! thank you. So the Species column has values that range from a single species to many, and this is stored as list separated by commas. The other two columns, Component and Commodity, in some of the other datasets also contain species names, hence I'm trying to filter by those three columns, so that the 'row' is kept if any of the 70+ words are seen — badiskhiari, Jun 28 '23 at 16:58
Thanks for the details - please edit your question and paste the output of `dput(head(NCR, 10))` instead of an image — jpsmith, Jun 28 '23 at 16:58
I just added the output and removed some of the irrelevant columns for brevity. Thanks — badiskhiari, Jun 28 '23 at 17:10
Please dont modify the `dput` - it prevents us from being able to put in in our system. If you want to remove unneeded columns, do `dput(head(NCR[,c("REGION", "COMPONENT","SPECIES")]))` — jpsmith, Jun 28 '23 at 17:11

jpsmith · Accepted Answer · 2023-06-28T17:34:40.510

0

You should be able to use dplyr::if_any within your dplyr::filter() here.

You didn't have any of the values in key_terms in you sample data, so 0 rows were returned. I tweaked the key_terms to include "Narra", which is found in a few rows

key_terms <- c('mangrove', 'alba', 'Narra')

filter(NCR, if_any(REGION:SPECIES, 
                   ~grepl(paste(key_terms, collapse='|'), .x)))

Output:

#1 NCR    Urban     "Narra"                                                                                                                                             
#2 NCR    Urban     "Bignay, Camachile, Nangka, Sampaloc, #Santol,Narra,kalumpit,langka,lipote,guyabano,palawan #cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,big…
#3 NCR    Urban     "Banaba, Caballero, Ilang ilang, Narra, Tuai,"

edited Jun 28 '23 at 17:34

answered Jun 28 '23 at 17:03

jpsmith

11,023
5
15
36

Thank you for the code and for your help, unfortunately, I run this but it still selects rows that do not have my target species/words, and varyingly excludes rows that do have them – badiskhiari Jun 30 '23 at 17:06
This is likely an issue with the data themselves, not the approach - if you provide a more complete structure of the data and key words, I can try to help – jpsmith Jun 30 '23 at 17:08
You are probably right, I fear that spaces, capital letters, and input errors may cause this. I'll see what I can do, but I'll add my key terms to the original question. Thanks – badiskhiari Jun 30 '23 at 17:13
So it has fully worked now, thank you! I had to double the list of the key terms to include capitalized versions of each word, plus many more spelling derivatives. Thanks – badiskhiari Jul 10 '23 at 15:50
Great! FYI you can use the command `ignore.case = TRUE` to include both upper and lower case without duplicating (ie, `grepl(..., ignore.case = TRUE)`) – jpsmith Jul 10 '23 at 18:38

Filter a dataframe based on a selection of string values found across multiple columns

1 Answers1