0

Here is a sample of data am working at, which consists of two columns V1 and V2

         V1        V2
1   A415Z A415Z   1.010
2   A415J A415Z   0.960
3   B416X A415Z   0.980
4   B416Z A415Z   0.990
5   B416J A415Z   1.020
6   B416M A415Z   1.085
7   B416P A415Z   6.380
8   B416W A415Z   0.995
9   D420R A415Z   0.995
10  D420H A415Z   0.975
11  B416X B416X   0.950
12  B416Z B416X   0.960
13  B416J B416X   0.990
14  B416M B416X   1.055

In the first column "V1" , I want to remove the rows which have the two words start with the same character. For example : In the first , second and last four rows the elements are: A415Z A415Z, A415J A415Z, B416X B416X, B416Z B416X , B416J B416X, B416M B416X. so the output should look like the one given below.

         V1         V2
1   B416X A415Z   0.980
2   B416Z A415Z   0.990
3   B416J A415Z   1.020
4   B416M A415Z   1.085
5   B416P A415Z   6.380
6   B416W A415Z   0.995
7   D420R A415Z   0.995
8   D420H A415Z   0.975

How can I make use of a regular expressions here? (or) if there is better method suggestions will be helpful.

Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137
  • You should split up your questions into two: 1. How can I detect whether two words start with the same character. 2. How can I keep only those rows matching a certain predicate. – Micha Wiedenmann May 05 '17 at 10:27

3 Answers3

3

Another possibility, using stringr package to extract and compare the first letters,

library(stringr)

df[unlist(lapply(str_extract_all(df$V1, '(?<=\\b)([A-z])'), function(i)
                                                             length(unique(i)) != 1)),]

#            V1    V2
#3  B416X A415Z 0.980
#4  B416Z A415Z 0.990
#5  B416J A415Z 1.020
#6  B416M A415Z 1.085
#7  B416P A415Z 6.380
#8  B416W A415Z 0.995
#9  D420R A415Z 0.995
#10 D420H A415Z 0.975

A different, simplified regex (as @Wiktor Stribiżew mentions in comments) would be

str_extract_all(df$V1, '\\b[A-Za-z]')
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • 2
    @Sotos: the regex can be "simplified" to `\\b[A-Za-z]` (or even `\\b\\p{L}`) since a word boundary is itself a zero-width assertion, and `[A-z]` [matches more than just letters](http://stackoverflow.com/a/29771926/3832970). – Wiktor Stribiżew May 05 '17 at 11:20
  • @WiktorStribiżew thank you for the link & suggestions. Very helpful. – Sotos May 05 '17 at 11:51
2

Using tidyr:

library(dplyr)
library(tidyr)

df1 %>%
  separate(V1, c("V1_1", "V1_2"), remove = FALSE) %>% 
  mutate(V1_1 = substr(V1_1, 1, 1),
         V1_2 = substr(V1_2, 1, 1)) %>% 
  filter(V1_1 != V1_2) %>% 
  select(V1, V2)

Separate 1st column into 2, then to compare use substring to get first characters and compare if they are same to filter.

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • The method works fine but produces a warning. Warning message: Too many values at 289 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ... – Amritha Amarnath May 05 '17 at 10:06
  • @AmrithaAmarnath with your example data, this solution works, please update your input data to reproduce the "warning". – zx8754 May 05 '17 at 10:07
2

Or a base R option is to match one or more numbers (\\d+) followed by one or more non-white space (\\s+) followed by zero or more white space (\\s*), replace it with blanks (""), then match the repeating characters ((.)\\1+), replace it with blanks in second gsub, get the number of characters (nchar), check if it is not equal to 0 i.e. if there are elements like 'AA' it will be removed, while those with BA or DA are kept) to subset the rows

df1[nchar(gsub("(.)\\1+", "", gsub("\\d+\\S+\\s*", "", df1$V1)))!=0,]
#          V1    V2
#3  B416X A415Z 0.980
#4  B416Z A415Z 0.990
#5  B416J A415Z 1.020
#6  B416M A415Z 1.085
#7  B416P A415Z 6.380
#8  B416W A415Z 0.995
#9  D420R A415Z 0.995
#10 D420H A415Z 0.975

Or just to be safe

df1[nchar(gsub("(.)\\1+", "", gsub("\\b(\\S)\\S+\\s*", "\\1", df1$V1))) !=0,]
akrun
  • 874,273
  • 37
  • 540
  • 662