Finding matches on a character in more than one position in R

Question

I have a character vector where I want to match the first and last parts so I can generate a list of matching characters.

Here is an example character: "20190625_165055_0f4e" The first part is a date. The last 4 characters are a unique identifier. I need all characters in the list where these two parts are duplicates.

I could use a simple regex to match characters according to position, but some have more middle characters than others, e.g. "20190813_170215_17_1057"

Here is an example vector:

mylist<-c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034","20190719_164712_1001","20190719_164713_1001","20190722_153110_1054","20190813_170215_17_1057","20190813_170217_22_1057","20190828_170318_14_1065")

With this being the desired output:

c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034")
c("20190719_164712_1001","20190719_164713_1001")
c("20190722_153110_1054")
c("20190813_170215_17_1057","20190813_170217_22_1057")
c("20190828_170318_14_1065")

edits: made my character vector more simple and added desired output

Hi APD, if the answers below don't resolve your issue, I agree with akrun that it will be easier to help if you provide some expected output. — Ian Campbell, Jul 05 '20 at 21:49
That was my original answer `split(mylist, sub("^(\\d+)_.*_([^_]+)$", "\\1_\\2", mylist))` — akrun, Jul 05 '20 at 21:52
The following Ruby code would do it, should someone want to translate it to R: `arr.group_by { |s| [s[0,8], s[-4,-2]] }.values`. — Cary Swoveland, Jul 06 '20 at 01:35

akrun · Accepted Answer · 2020-07-05T21:53:41.313

We could remove the middle substring with sub and split the list based on that into a list of character vectors

lst1 <- split(mylist, sub("^(\\d+)_.*_([^_]+)$", "\\1_\\2", mylist))
lst1
#$`20190712_1034`
#[1] "20190712_164755_1034" "20190712_164756_1034" "20190712_164757_1034"

#$`20190719_1001`
#[1] "20190719_164712_1001" "20190719_164713_1001"

#$`20190722_1054`
#[1] "20190722_153110_1054"

#$`20190813_1057`
#[1] "20190813_170215_17_1057" "20190813_170217_22_1057"

#$`20190828_1065`
#[1] "20190828_170318_14_1065"

In the sub, we capture ((...)) one or more digits (\\d+) from the start (^) of the string, followed by a _, and other characters (.*) till the _ and capture the rest of the characters that are not a _ ([^_]+) till the end ($) of the string. In the replacement, we specify the backreference (\\1, \\2) of the captured groups). Essentially, removing the varying part in the middle and keep the fixed substring at the beginning and end and use that to split the character vector

That almost works. When the middle substring is of different length it treats it differently. Running the suggestion on mylist results in correct duplicates except for characters with substrings with an extra two characters e.g. substrings are typically dddddd, but these are dddddd_dd — APD, Jul 05 '20 at 21:39

Ian Campbell · Answer 2 · 2020-07-05T21:58:14.110

Here's an alternative approach with extract from tidyr.

library(tidyr)
result <- as.data.frame(mylist) %>%
  extract(1, into = c("date","var1","var2"),
          regex = "(^[0-9]{8}_[0-9]{6})_?(.*)?_([^_]+$)",
          remove = FALSE)
result
#                    mylist            date var1 var2
#1     20190625_165055_0f4e 20190625_165055      0f4e
#2     20190625_165056_0f4e 20190625_165056      0f4e
#3     20190625_165057_0f4e 20190625_165057      0f4e
#4     20190712_164755_1034 20190712_164755      1034
#...
#27 20190828_170318_14_1065 20190828_170318   14 1065
#28 20190828_170320_26_1065 20190828_170320   26 1065
#...

Now you can easily manipulate the data based on those variables.

split(result,result$var2)
#$`0f22`
#                 mylist            date var1 var2
#29 20190917_165157_0f22 20190917_165157      0f22
#
#$`0f2a`
#                 mylist            date var1 var2
#18 20190813_152856_0f2a 20190813_152856      0f2a
#19 20190813_152857_0f2a 20190813_152857      0f2a
#...

score 0 · Answer 3 · answered Jul 06 '20 at 00:27

We can use extract to extract the date part and last 4 characters into separate columns. We then use group_split to split data based on those 2 columns.

tibble::tibble(mylist) %>%
   tidyr::extract(mylist, c('col1', 'col2'), regex = '(.*?)_.*_(.*)', 
                  remove = FALSE) %>%
   dplyr::group_split(col1, col2, .keep = FALSE)


#[[1]]
# A tibble: 3 x 1
#  mylist              
#  <chr>               
#1 20190712_164755_1034
#2 20190712_164756_1034
#3 20190712_164757_1034

#[[2]]
# A tibble: 2 x 1
#  mylist              
#  <chr>               
#1 20190719_164712_1001
#2 20190719_164713_1001

#[[3]]
# A tibble: 1 x 1
#  mylist              
#  <chr>               
#1 20190722_153110_1054
#...

Finding matches on a character in more than one position in R

3 Answers3