Subsetting a dataframe that contains words from a list in R

Question

I have a database (let's call it "DBA") containing locations for species observations which is aprox. 32k rows long and another database ("DBB") containing aprox. 8.7k names of locations within the area of study. I need to develop a script where I create a subset ("DBC") of DBA consisting of only entries which contain any of the words listed on DBB. It should check each of the 32k rows to look for the words on the first position of DBB, then the second position and so on... It is possible that more than one entry in DBA contain the words "empire state".

That means that if there's a entry on DBB called "empire state", all rows containing these words would be included on DBC. Ideally, this script would consier entries like "somewhere near empire state" or "empire state building". If this is not possible, an exact match of words would suffice.

I know subset() would deliver me exactly what i want, if i had only one location name, such as:

DBA = as_tibble(read.csv("./table.csv"))

DBC = subset(DBA, DBA$locationname == "empire state")

However, I can't make it work on a list, and I have 8.7k names of locations, which I'm not willing to type by hand. I also tried including the select() function on my subset, but I received errors...

I have read answers where such problem was addressed with Python like here, here or here but I'm trying to find a solution using R.

score 2 · Accepted Answer · answered Jan 05 '23 at 22:42

2

If it is a vector of strings with length > 1, use %in%

vec_of_names <- c("empire state", "empire state building")
subset(DBA, locationname %in% vec_of_names)

answered Jan 05 '23 at 22:42

akrun

874,273
37
540
662

This answer is helpful, but I can't type all tiny variations of 8k possible names. I want my list to have only "empire state", but when subsetting DBA, it will also select entries on DBA that have "empire state building" written on it. Meaning the content of a cell doesn't need to be an exact match. – Eric Lino Jan 05 '23 at 22:52
1

@EricLino If you have a vector of unique substring names, you may use `grepl` as well `subset(DBA, Reduce("|", lapply(vec_substring, function(pat) grepl(pat, locationname))))` – akrun Jan 05 '23 at 22:57
@arkun your proposed solution worked for only one entry, then it gave the following error. "In grepl(pat, locationname) : argument 'pattern' has length > 1 and only the first element will be used. Should i warp this with a for snippet, and a rbind? – Eric Lino Jan 06 '23 at 15:22
@EricLino `lapply` is looping over each element of the vec_substring to be applied on grepl, then we `Reduce` to a single logical vector for subsetting – akrun Jan 06 '23 at 18:03
@arkun Okay, but the function suggested doesn't work. Would you like me to provide you with the data so you can assess it better? – Eric Lino Jan 06 '23 at 20:54
@EricLino You can update your post with a small reproducible example using `dput` on a subset of data alogn with your expected output so that it becomes easier to test – akrun Jan 06 '23 at 20:55
@arkun I'm sorry, I've managed to make it work, I was missing an argument. However, now I need to exclude some words from the search. Generic words, such as "river", "forest". Where should I insert it on the code? Thanks in advance! – Eric Lino Jan 06 '23 at 21:16
@EricLino I didn't get your comment. Can you show a small exampl – akrun Jan 06 '23 at 21:19
@arkun data is available at https://wetransfer.com/downloads/17a0bad56b6c0e07b5c9eed3074fc37320230110174355/3a61b8 At line 14 of the code, I want to insert some words that my filter can ignore, but i don't know where to insert it. – Eric Lino Jan 10 '23 at 17:55

Subsetting a dataframe that contains words from a list in R

1 Answers1