2

I have a data frame with roughly 20 000 rows and 215 columns and need to search, in which columns certain keywords occur (if they exist).

There are lots of suggestions for partial matches in a specified column, for example

Selecting data frame rows based on partial string match in a column

Alas, none of these functions seem to allow to search ALL columns. One option is of course to write several nested loops.

However, I wonder whether there is a much more efficient way == already existing function to search a) all columns of a data frame (or: all lists within a list)? b) and possibly not to search only for one phrase, but for a list of keywords?

For example

# some data
Species <- c("Acanthurus dussumieri", "Callionymus maculatus", "Eviota prasina", "Gymnogobius urotaenia", "Kyphosus bigibbus")
Column1 <- c(60.1, 106, 78.6, 21.5, 71)
ColumnEgg <- c(11.2, 14.5, 12, 8, NA)
Add_Info <- c("Spawns when water temperatures reach above 15°C.", NA, "females deposit eggs of 1.5 mm diameter on plants. Larvae hatch after 3-13 days.", NA, "55 cm TL newborn weighs 380 g")
    
df <- data.frame(Species, Column1, ColumnEgg, Add_Info)
df

Now it is easy to search, if one knows in which column to look for a pattern, e.g.

library(stringr)
library(dplyr)

df%>%
  filter(str_detect(Species,"Aaptosyax"))

However: how to search all columns for a phrase or a list of keywords, like

df%>%
      filter(str_detect(df[1:4],"Aaptosyax"))

or

keywords <- c("Aaptosyax", "egg")
df%>%
          filter(str_detect(df[1:4],keywords))

Thanks a lot for any help!

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
user2006697
  • 1,107
  • 2
  • 11
  • 25

4 Answers4

3

A base R option using subset + grepl + rowSums

subset(
  df,
  rowSums(sapply(df, grepl, pattern = "Aaptosyax")) > 0
)
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
2

If you use :

grep(pattern = your_pattern, x = your_dataframe)

So if you search the word "planet" and your dataframe is named my_df then :

grep(pattern = "planet", x = my_df)

For a list of keywords you can separate them by a pipe '|' in the pattern :

grep(pattern = "planet|egg", x = my_df)

It will return the all the column where the pattern matched.

DataM
  • 351
  • 1
  • 7
  • Something like `grep("Acanthurus", x=df)` (or `grepl`) returns the *column* with a match, but not the rows (which would be necessary to use within `filter(grepl(..))`. Unfortunately, I think the OP is stuck iterating over columns with something. – r2evans Sep 01 '21 at 11:51
2

Use if_any -

library(dplyr)
library(stringr)

df %>% filter(if_any(1:4, str_detect, "Aaptosyax"))

For multiple keywords, collapse them into one string.

keywords <- c("Aaptosyax", "egg")
df %>% filter(if_any(1:4, str_detect, paste0(keywords, collapse = '|')))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Sorry, one more question to this solutions: how/where to include case-insensitiv search for str_detect, i.e. (?i) or ignore_case = TRUE? Thanks a lot! – user2006697 Aug 31 '21 at 14:17
  • That would be `df %>% filter(if_any(1:4, ~str_detect(., regex("Aaptosyax", ignore_case = TRUE))))` and for the second case `df %>% filter(if_any(1:4, ~str_detect(., regex(paste0(keywords, collapse = '|')))))` – Ronak Shah Aug 31 '21 at 14:20
  • Thanks a million! Really, very much appreciated! – user2006697 Aug 31 '21 at 14:39
  • HI, I tried to add column wise, i.e. that the returned result contains only rows and only columns with matches - alas i cannot get select() working. May I ask for your help once more? – user2006697 Sep 02 '21 at 08:14
  • Sorry, I don't understand the question. Maybe you want to ask that as a new question ? – Ronak Shah Sep 02 '21 at 10:13
2

Sticking with dplyr you can use across to filter through various columns.

df %>%
 filter(if_any(everything(), ~str_detect(.,keywords))

The colwise vignette is a good guide https://dplyr.tidyverse.org/articles/colwise.html

Quixotic22
  • 2,894
  • 1
  • 6
  • 14