1

I have a data frame that looks like this:

df <- data.frame(V1 = c(">NP_abc d", "1", "Efg", "hij", "2", "3KL", 
                        "VGSFNGWDGRRHPMRLRHPTGVWEIFVPRLQPGEVYKYEILGAHGILPLKSDPMALATTLPPDTASKISAPLKFEWHDQD",
                        ">WP_mno p", "Rst", "uw", "adb", "cgi",
                        "PGLGIRLDKLASFVVTLSLYAGAYLTEVFRAGLLSIHKGQREAGLAIGLGEWQVRAYIIVPVMLRNVLPALSNNFISLFK",
                        ">GP_hgs i Hhh yuy",
                        "WEGLETPVQVVWRHALLPVIELPLAALHDPEPLNLLDAPLLRLVHAEDPDNQRIVAVLLFHHLIMDHVALDLLSHELQAV"))

I want to remove the rows that occur between two rows. First row contains a string that always starts with a fixed character ">". The second row contains a string of variable characters that has a fixed length of 80.

Here is a secondary objective. If any of the rows tagged for removal contains strings that matches a string stored in another vector, then append it to the string starting with ">" in the flanking row. In addition, the strings should be appended in order of their occurence in the rows targeted for removal and be preeceded by a blank space.

Here is the vector for matching the strings:

some_vector <- c("Efg", "hij", "Rst", "adb")

This is what the output should look like:

df1 <- data.frame(V1 = c(">NP_abc d Efg hij",
                        "VGSFNGWDGRRHPMRLRHPTGVWEIFVPRLQPGEVYKYEILGAHGILPLKSDPMALATTLPPDTASKISAPLKFEWHDQD",
                        ">WP_mno p Rst adb",
                        "PGLGIRLDKLASFVVTLSLYAGAYLTEVFRAGLLSIHKGQREAGLAIGLGEWQVRAYIIVPVMLRNVLPALSNNFISLFK",
                        ">GP_hgs i Hhh yuy",
                        "WEGLETPVQVVWRHALLPVIELPLAALHDPEPLNLLDAPLLRLVHAEDPDNQRIVAVLLFHHLIMDHVALDLLSHELQAV"))

Please advice. This problem is way over my head and I haven't been able to tackle it beyond figuring out the logic directing the transformation of my data frame.

Best regards!

1 Answers1

1

It's pretty messy but you may try

library(stringr)
library(dplyr)

key = (!is.na(str_match(df$V1, ">")) ) + (nchar(df$V1) == 80) * 2 + df$V1 %in% match$V1 

# str_match(...">") gives ">" if there is > or NA, so first part is to set key where ">" exist
# nchar(...) gets length of character of ..., and twice of T/F to make sure where row ends
# similar with str_match part, get which element matches with your match, which is now some_vector

key2 = cumsum(lag(key == 2, default = 0))

# after length 80 character, the line changes, this line is to indicate between different rows.


res <- c()

for (i in unique(key2)){ # for unique ids of rows,
  first_line <- paste0(df$V1[intersect(which(key == 1), which(key2 == i))], collapse = " ") # notice that key == 1 get start of row and matches, and use interesct to get each row's element
  second_line <- df$V1[intersect(which(key == 2), which(key2 == i))] # get end of each row
  res <- c(res, first_line, second_line)
}
res


[1] ">NP_abc d Efg hij"                                                               
[2] "VGSFNGWDGRRHPMRLRHPTGVWEIFVPRLQPGEVYKYEILGAHGILPLKSDPMALATTLPPDTASKISAPLKFEWHDQD"
[3] ">WP_mno p Rst adb"                                                               
[4] "PGLGIRLDKLASFVVTLSLYAGAYLTEVFRAGLLSIHKGQREAGLAIGLGEWQVRAYIIVPVMLRNVLPALSNNFISLFK"
[5] ">GP_hgs i Hhh yuy"                                                               
[6] "WEGLETPVQVVWRHALLPVIELPLAALHDPEPLNLLDAPLLRLVHAEDPDNQRIVAVLLFHHLIMDHVALDLLSHELQAV"
Park
  • 14,771
  • 6
  • 10
  • 29