0

I am trying to filter some strings in the data. For example I want to filter out 'AxxBy' strings but there is this string 'AxxByy' I want to keep! x and y stands for number of digits!

Here is what I tried,

data <- data.frame(pair=paste(paste('A',c(seq(1:4),10,11),sep=''),paste('B',c(2,3,4,22,33,44),sep=''),sep='')) 
    pair
1   A1B2
2   A2B3
3   A3B4
4  A4B22
5 A10B33
6 A11B44

I want to remove those pairs starting with A1 but not A10 and A11. Same as for also B2 but keep B22! etc.

x <- c(paste('A',1,sep=''), paste('B',2,sep='')) # filtering conditions

library(dplyr)
df <- data%>%
  filter(!grepl(paste(x,collapse='|'),pair))

 pair
1 A2B3
2 A3B4

In this post Filtering observations in dplyr in combination with grepl it is possible to add line starting with "^x|xx$" by regex functions but I haven't seen any post if the filtering conditions defined outside of the pipe.

Expected output

  pair
    1   A2B33
    2   A3B4
    3   A4B22
    4   A10B33
    6   A11B44

The thumb of rule is that; if there is two digits after 'A' put B so AxxB and !grepl everything for defined xx numbers in the x input. if there is only 'B' and one digit which is 'By' is given !grepl 'By$' not 'Byy' inputs. Of course this includes 'AxBy$' and 'AxxBy$' that's all. I still cannot generalize @alistaire solution!

Alexander
  • 4,527
  • 5
  • 51
  • 98
  • 2
    What is the rule for which pairs should be filtered out? Is it just "A1" and "B2", i.e. those specific letters paired with those specific numbers? In your explanation you seem to say you want B22 filtered out and B2 kept, but then your expected output shows the opposite. – Marius May 29 '17 at 22:43
  • @Marius The rule is very simple. As decided in the `x`, remove the strings starting with `A1` but not `A10` same thing for `B2`. Sorry I re-edited OP. Thanks for correction. – Alexander May 29 '17 at 22:51
  • 1
    The question of @Marius is still a good one: what is the rule. Presumably, you do not have just these 6 entries. Are they all of the form A(some numbers)B(some numbers). Do you really just want to eliminate A1B(anything) and A(numbers)B2 – G5W May 29 '17 at 23:02
  • 1
    `data %>% filter(!grepl('A1B|B2$', pair))`? Your example doesn't line up with your sample data. – alistaire May 29 '17 at 23:19
  • @alistaire thanks for post. You are not using `x`. just defination inside of the grepl. I know that. The problem lies with I cannot use `paste` command inside of the `grepl` to define many conditions! – Alexander May 29 '17 at 23:27
  • @G5W yes I dont have just these 6 entries. I have almost 120 entries to remove. Yes they all the form A(some numbers) B (some numbers). yes, Eliminate A1B anything and A(numbers) B2. That's correct! But remember that in the real data there is also A3B or else so finally maybe grepl with paste functions is required. – Alexander May 29 '17 at 23:31
  • If I understand your response to @alistaire , you are saying that you are _not_ just eliminating A1 and B2. Rather, those were just examples and you really have many patterns to delete. Is that correct? – G5W May 29 '17 at 23:57
  • @G5W that is correct! – Alexander May 30 '17 at 00:06
  • 1
    So collapse your pattern: `paste0(x, ifelse(grepl('A', x), 'B', '$'), collapse = '|')` – alistaire May 30 '17 at 00:22
  • Do you want to remove all rows where A is followed by just _one_ digit and B is followed by just _one_ digit, e.g., `A4B6` or `A0B9`? So, the pattern for rows to remove would look like `AxBy` where x and y stand for just one digit, resp.? – Uwe May 31 '17 at 10:46
  • @UweBlock. I realized that I made the simple problem very complex. The rule is if there is two digits after 'A' put B so AxxB and !grepl everything for defined xx numbers in the `x` input. if there is only 'B' and one digit which is 'By' is given !grepl 'By$' not 'Byy' inputs. Of course this includes 'AxBy$' and 'AxxBy$' that's all. I still cannot generalize @alistaire solution! – Alexander Jun 04 '17 at 23:03

1 Answers1

2

The OP has requested to filter out 'AxxBy' strings but wants to keep string 'AxxByy' (where 'x' and 'y' denote digits.

Often it is easier to specify what to keep than what to remove. To keep strings which obey the pattern 'AxxByy' the regular expression

"^A\\d{2}B\\d{2}$"

can be used where ^ denotes the begin of the string, \\d{2} a sequence of exactly two digits, and $ the end of the string. A and B stand for themselves.

With this regular expression, dplyr, and grepl() can be used to filter the input data frame DF:

library(dplyr)
#which rows are kept?
kept <- DF %>%
+   filter(grepl("^A\\d{2}B\\d{2}$", pair))
kept
#    pair
#1 A10B33
#2 A11B44

# which rows are removed?
removed <- DF %>%
+   filter(!grepl("^A\\d{2}B\\d{2}$", pair))
removed
#      pair
#1     A1B2
#2     A2B3
#3     A3B4
#4    A4B22
#5       AB
#6        A
#7        B
#8       A1
#9      A12
#10      B1
#11     B12
#12 AA12B34
#13 A12BB34

Note that I've added some edge cases for demonstration.


BTW: dplyr is not required if only the vector pair needs to be filtered. So, in base R the alternative expressions

pair[grepl("^A\\d{2}B\\d{2}$", pair)]
grep("^A\\d{2}B\\d{2}$", pair, value = TRUE)

both return the strings to keep:

[1] "A10B33" "A11B44"

while

pair[!grepl("^A\\d{2}B\\d{2}$", pair)]

returns the removed strings:

 [1] "A1B2"    "A2B3"    "A3B4"    "A4B22"   "AB"      "A"       "B"       "A1"     
 [9] "A12"     "B1"      "B12"     "AA12B34" "A12BB34"

Data

As given by the OP but with some edge cases appended:

# create vector of test patterns using paste0() instead of paste(..., sep = "")
pair <- paste0("A", c(1:4, 10, 11), "B", c(2, 3, 4, 22, 33, 44))
# alternatvely use sprintf()
pair <- sprintf("A%iB%i", c(1:4, 10, 11), c(2, 3, 4, 22, 33, 44))
# add some edge cases
pair <- append(pair, c("AB", "A", "B", "A1", "A12", "B1", "B12", "AA12B34", "A12BB34"))
# create data frame
DF <- data.frame(pair)
DF
#      pair
#1     A1B2
#2     A2B3
#3     A3B4
#4    A4B22
#5   A10B33
#6   A11B44
#7       AB
#8        A
#9        B
#10      A1
#11     A12
#12      B1
#13     B12
#14 AA12B34
#15 A12BB34
Uwe
  • 41,420
  • 11
  • 90
  • 134