4

I have a dataframe(df)

    V1    V2
1 "BCC"  Yes
2 "ABB"  Yes

I want to find all the strings that contain a certain sequence of characters, regardless of the order. For example if I have the string "CBC" or "CCB" I would like to get

    V1    V2
1 "BCC"  Yes

I've tried with grep, but It only finds the matching patterns

>df[grep("CBC", df$V1),]
1  V1   V2
<0 rows> (or 0-length row.names)

>df[grep("BCC", df$V1),]
   V1   V2
1 "BCC" Yes
aipam
  • 137
  • 3
  • 10
  • will they always be strings of 3-letters? – SymbolixAU Jun 28 '18 at 23:10
  • @SymbolixAU yup, only 3 letters – aipam Jun 28 '18 at 23:11
  • 1
    If the sequence to match is BBC do you require there to be at least two "B"s? – Dason Jun 28 '18 at 23:15
  • 2
    Try just `df[grepl("^[CB]+$", df$V1),]` where `^[CB]+$` matches any string containing 1 or more `B` or/and `C` chars. If you want to only match 3-char strings, replace `+` with `{3}`. – Wiktor Stribiżew Jun 28 '18 at 23:15
  • @Dason yes. I have to match all the possible 3-letter strings that are in another Dataframe. So it should be also “BAB”, “BBA” and so on. This dataframe is like a lookup table – aipam Jun 28 '18 at 23:19
  • 1
    Your question is not clear. If you have another dataset and you want to compare with the all the elements from that dataset, it should be included in the question – akrun Jun 28 '18 at 23:27

4 Answers4

4

We can create a logical index by splitting the column

i1 <- sapply(strsplit(df$V1, ""), function(x) all(c("B", "C") %in% x))
df[i1, , drop = FALSE]
#   V1  V2
#1 BCC Yes

if we have two datasets and one is a lookup table ('df2'), then split the column into characters,paste the sorted elements, and use %in% to create the logical vector for filtering the rows

v1n <- sapply(strsplit(df1$v1, ""), function(x) paste(sort(x), collapse=""))
v1l <- sapply(strsplit(df2$v1, ""), function(x) paste(sort(x), collapse=""))
df1[v1n %in% v1l, , drop = FALSE]

data

df1 <- data.frame(v1 = c("BCC", "CAB" , "ABB", "CBC", "CCB", "BAB", "CDB"),
     stringsAsFactors = FALSE)
df2 <- data.frame(v1 = c("CBC", "ABB"), stringsAsFactors = FALSE)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    @RavinderSingh13 Thanks. Have you looked into any courses offered by datacamp, udemy, udactiy etc. Books can help a lot in understanding, but practise is the key. I didn't take any course in R, but I took many courses to learn other languages. My take is that you need constant practice beyond what the course offers – akrun Jun 29 '18 at 02:40
  • Yeah, I am actually reading books(online only) but expertise that you have(truly bottom of my heart I am saying I become fan of you) that will come by practice and moreover I didn't see these many examples apart from basics, I will try my BEST to do practice, thanks for suggestion sir. – RavinderSingh13 Jun 29 '18 at 02:48
  • 1
    @RavinderSingh13 Sure, It comes from practise. Suppose, if I start with "awk' on day one and compare it with you, I would be feeling the same thing as you are feeling now. – akrun Jun 29 '18 at 02:56
  • 1
    To be honest not comparing :) but made you an ideal so that we could reach somewhere(like a GURU kinda) :) what do you say(R guru) :) – RavinderSingh13 Jun 29 '18 at 02:57
  • 2
    @RavinderSingh13 Sorry, I should have used a different word. I meant that you are far ahead on awk than me because I don't practise it. I do understand that learning a language to understand all the quirks and behaviors of functions, takes some time. I would advise to spend at least an hour or half an hour every day and be consistent instead of doing things for 24 hours and then stop – akrun Jun 29 '18 at 03:00
  • 1
    Sure sir, I got it. Thanks for your cool advice will try my best to keep the consistency and keep learning from you too, cheers :) – RavinderSingh13 Jun 29 '18 at 03:02
3

In the comments you mention a lookup table. If this is the case, an approach could be to join both sets together, then use the regex by Wiktor Stribiżew to indicate which are valid

As I'm joining data sets I'm going to use data.table

Method 1: Join everything

library(data.table)

## dummy data, and a lookup table
dt <- data.frame(V1 = c("BCC", "ABB"))
dt_lookup <- data.frame(V1 = c("CBC","BAB", "CCB"))

## convert to data.table
setDT(dt); setDT(dt_lookup)

## add some indexes to keep track of rows from each dt
dt[, idx := .I]
dt_lookup[, l_idx := .I]

## create a column to join on
dt[, key := 1L]
dt_lookup[, key := 1L]

## join EVERYTHING
dt <- dt[
    dt_lookup
    , on = "key"
    , allow.cartesian = T
]

#regex
dt[
    , valid := grepl(paste0("^[",i.V1,"]+$"), V1)
    , by = 1:nrow(dt)
]

#     V1 idx key i.V1 l_idx valid
# 1: BCC   1   1  CBC     1  TRUE
# 2: ABB   2   1  CBC     1 FALSE
# 3: BCC   1   1  BAB     2 FALSE
# 4: ABB   2   1  BAB     2  TRUE
# 5: BCC   1   1  CCB     3  TRUE
# 6: ABB   2   1  CCB     3 FALSE

Method 2: EACHI join

A slightly more memory-efficient approach might be to use this technique by Jaap as it avoids the 'join everything' step, and in stead joins it 'by each i' (row) at a time.

dt_lookup[
    dt, 
    {
        valid = grepl(paste0("^[",i.V1,"]+$"), V1)
        .(
            V1 = V1[valid]
            , idx = i.idx
            , match = i.V1
            , l_idx = l_idx[valid]
            )
    }
    , on = "key"
    , by = .EACHI
]

#    key  V1 idx match l_idx
# 1:   1 CBC   1   BCC     1
# 2:   1 CCB   1   BCC     3
# 3:   1 BAB   2   ABB     2
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • 1
    Great answer. I like that you can see whether each string matches, and it's easy to see whether any or all of them match with `dt[, Reduce(\`|\`, valid), V1]` and `dt[, Reduce(\`&\`, valid), V1]` – IceCreamToucan Jun 28 '18 at 23:58
2

Here is one method using sapply, table, and identical.

# construct a named vector of integers with names in 
# alphabetical order: your match
myVal <- c("B"=1L, "C"=2L)
# run through character variable, perform check
sapply(strsplit(dat$V1, ""), function(x) identical(c(table(x)), myVal))
[1]  TRUE FALSE

Two key points related to the use of identical and the output of table:

  1. the match vector, myVal must be integer.
  2. You want to order the match vector alphabetically, yough you can do this ahead of time, you can also do it after the fact with order, names, and [.

Also, not that I wrapped the output of table in c to strip off undesired attributes, while maintaining the names.

lmo
  • 37,904
  • 9
  • 56
  • 69
2

You can use stringi::stri_count_regex to see if the number of occurrences in your string matches the table of strsplit(str_to_find, ''). The last reduce("|") means it's checking if there are any matches, so change | to & if you want to check if it matches all the strings in to.find.

set.seed(0)
df <- data.frame(a = replicate(20, paste0(sample(LETTERS[1:3], 3, T), collapse = ''))
                 , stringsAsFactors = F)

to.find <- c("CBB", "CCB")
to.find <- strsplit(to.find, '')

library(tidyverse)
library(stringi)
df$b <- 
sapply(df$a, function(x){
         lapply(to.find, function(y){
           imap(table(y), ~ .x == stri_count_regex(x, .y)) %>% 
             reduce(`&`)}) %>% 
          reduce(`|`)})

df

# a     b
# 1  CAB FALSE
# 2  BCA FALSE
# 3  CCB  TRUE
# 4  BAA FALSE
# 5  ACB FALSE
# 6  CBC  TRUE
# 7  CBC  TRUE
# 8  CAB FALSE
# 9  AAB FALSE
# 10 ABC FALSE
# 11 BBB FALSE
# 12 BAC FALSE
# 13 CCA FALSE
# 14 CBC  TRUE
# 15 BCB  TRUE
# 16 BCA FALSE
# 17 BCC  TRUE
# 18 BCB  TRUE
# 19 AAA FALSE
# 20 ABB FALSE
# 19 AAA FALSE
# 20 ABB FALSE

You can also do it all with map, but that's harder to read

df$b <- 
df$a %>% 
  map(~{x <- .x
        map(to.find, 
            ~imap(table(.x), ~ .x == stri_count_regex(x, .y)) %>% 
              reduce(`&`)) %>% 
          reduce(`|`)})
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38