1

I have a dataframe;

vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status)

vessel           type   class             status
1      a Fishery Vessel      NA                 NA
2      b             NA FISHING                 NA
3      c             NA      NA Engaged in Fishing
4      d          Cargo   CARGO           Underway

I would like to subset the df to contain only those rows relating to fishing (ie rows 1:3) so that means to me doing something like;

df.sub<-subset(grep("FISH", df) | grep("Fish", df))

But this doesn't work. I've been trialing apply (such as this question) or partial string matching using grep (like this question) but I can't seem to pull it all together.

Grateful for any help. My data is 10s of columns and up to a million rows, so trying my best to avoid loops if possible but maybe that's the only way? Thanks!

Cyrus
  • 84,225
  • 14
  • 89
  • 153
user2175481
  • 147
  • 1
  • 8

3 Answers3

1

If you want to use apply() you could compute an index based on your string fish and then subset. The way to compute Index is obtaining the sum of those values which match with fish using grepl(). You can enable ignore.case = T in order to avoid issues with upper or lower case text. When the index is greater or equal to 1 then any match occurred so you can make the subset. Here the code:

#Data
vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status,stringsAsFactors = F)
#Subset
#Create an index with apply
df$Index <- apply(df[1:4],1,function(x) sum(grepl('fish',x,ignore.case = T)))
#Filter
df.sub<-subset(df,Index>=1)

Output:

  vessel           type   class             status Index
1      a Fishery Vessel      NA                 NA     1
2      b             NA FISHING                 NA     1
3      c             NA      NA Engaged in Fishing     1
Duck
  • 39,058
  • 13
  • 42
  • 84
  • that's worked and is quite fast for my data, many thanks! – user2175481 Sep 07 '20 at 17:59
  • by the way, why did you add the 'stringsAsFactors=F' when creating the data.frame? Does apply or grepl not work for factors? – user2175481 Sep 07 '20 at 18:23
  • 1
    @user2175481 You are right, factors are coded as numbers so `grepl()` won't work. In your case, you must use `grepl()` for testing a condition over a string. Then, it is preferred having variables as text. If that explanation was not clear enough let me know! – Duck Sep 07 '20 at 18:26
1

Another option you can try

library(dplyr)
library(stringr)
df %>% 
  filter_all(any_vars(str_detect(., regex("fish", ignore_case =TRUE))))
#   vessel           type   class             status
# 1      a Fishery Vessel      NA                 NA
# 2      b             NA FISHING                 NA
# 3      c             NA      NA Engaged in Fishing
Tho Vu
  • 1,304
  • 2
  • 8
  • 20
0

In base R, we can use vectorized option with grepl and Reduce

subset(df, Reduce(`|`, lapply(df[-1], grepl, pattern = 'fish', ignore.case = TRUE)))
#  vessel           type   class             status
#1      a Fishery Vessel      NA                 NA
#2      b             NA FISHING                 NA
#3      c             NA      NA Engaged in Fishing
akrun
  • 874,273
  • 37
  • 540
  • 662