How to remove common parts of strings in a character vector in R?

Question

Assume a character vector like the following

file1_p1_analysed_samples.txt
file1_p1_raw_samples.txt
f2_file2_p1_analysed_samples.txt
f3_file3_p1_raw_samples.txt

Desired output:

file1_p1_analysed
file1_p1_raw
file2_p1_analysed
file3_p1_raw

I would like to compare the elements and remove parts of the string from start and end as much as possible but keep them unique.

The above one is just an example. The parts to be removed are not common to all elements. I need a general solution independent of the strings in the above example.

So far I have been able to chuck off parts that are common to all elements, provided the separator and the resulting split parts are of same length. Here is the function,

mf <- function(x,sep){
    xsplit = strsplit(x,split = sep)
    xdfm <- as.data.frame(do.call(rbind,xsplit))
    res <- list()
    for (i in 1:ncol(xdfm)){
        if (!all(xdfm[,i] == xdfm[1,i])){
            res[[length(res)+1]] <- as.character(xdfm[,i])
        }
    }
    res <- as.data.frame(do.call(rbind,res))
    res <- apply(res,2,function(x) paste(x,collapse="_"))
    return(res)
}

Applying the above function:

 a = c("a_samples.txt","b_samples.txt")
 mf(a,"_")
  V1  V2
 "a" "b"

2.

> b = c("apple.fruit.txt","orange.fruit.txt")
> mf(b,sep = "\\.")
      V1       V2
 "apple" "orange"

If the resulting split parts are not same length, this doesn't work.

If the part you want to remove is the same across all elements, this is easy: `gsub("_samples.txt", "", [your vector])`. — ulfelder, Apr 12 '17 at 11:20
@Veera why did you remove `f2` and `f3`? They are not the same. — pogibas, Apr 12 '17 at 11:21
@PoGibas Yes. But after removing them, still the resulting strings are unique. I would like to stop removing only when the strings are no longer unique. — Veera, Apr 12 '17 at 11:23
@ulfelder: You need to escape the dot. Otherwise it is only another character (which might be a dot indeed but could be anything else). — Jan, Apr 12 '17 at 11:25

score 1 · Answer 1 · answered Apr 12 '17 at 11:23

1

What about

files <- c("file1_p1_analysed_samples.txt", "file1_p1_raw_samples.txt", "f2_file2_p1_analysed_samples.txt", "f3_file3_p1_raw_samples.txt")
new_files <- gsub('_samples\\.txt', '', files)
new_files

... which yields

[1] "file1_p1_analysed"    "file1_p1_raw"         "f2_file2_p1_analysed" "f3_file3_p1_raw"

This removes the _samples.txt part from your strings.

answered Apr 12 '17 at 11:23

Jan

42,290
8
54
79

No this is not what I want. I need a solution that automatically determines unique and non-unique parts of the strings in the vector and remove the non-unique parts only. – Veera Apr 12 '17 at 11:33

Erik Schutte · Answer 2 · 2017-04-12T11:35:05.407

Why not:

strings <- c("file1_p1_analysed_samples.txt",
"file1_p1_raw_samples.txt",
"f2_file2_p1_analysed_samples.txt",
"f3_file3_p1_raw_samples.txt")

sapply(strings, function(x) {
  pattern <- ".*(file[0-9].*)_samples\\.txt"
  gsub(x, pattern = pattern, replacement = "\\1")
})

Things that match between ( and ) can be called back as a group in the replacement with backwards referencing. You can do this with \\1. You can even specify multiple groups!

Seeing your comment on Jan's answer. Why not define your static bits and paste together a pattern and always surround them with parentheses? Then you can always call \\i in the replacement of gsub.

Oh, @Jan's answer is way better. Perhaps change his pattern with this: `'.*(file[0-9].*)_samples\\.txt'` and don't forget the back referencing with `\\1` — Erik Schutte, Apr 12 '17 at 11:29

How to remove common parts of strings in a character vector in R?

2 Answers2

Linked