Remove rows flanked by two other rows: one contains a string that starts with fixed character ">", the other contains a string of 80 characters

Question

I have a data frame that looks like this:

df <- data.frame(V1 = c(">NP_abc d", "1", "Efg", "hij", "2", "3KL", 
                        "VGSFNGWDGRRHPMRLRHPTGVWEIFVPRLQPGEVYKYEILGAHGILPLKSDPMALATTLPPDTASKISAPLKFEWHDQD",
                        ">WP_mno p", "Rst", "uw", "adb", "cgi",
                        "PGLGIRLDKLASFVVTLSLYAGAYLTEVFRAGLLSIHKGQREAGLAIGLGEWQVRAYIIVPVMLRNVLPALSNNFISLFK",
                        ">GP_hgs i Hhh yuy",
                        "WEGLETPVQVVWRHALLPVIELPLAALHDPEPLNLLDAPLLRLVHAEDPDNQRIVAVLLFHHLIMDHVALDLLSHELQAV"))

I want to remove the rows that occur between two rows. First row contains a string that always starts with a fixed character ">". The second row contains a string of variable characters that has a fixed length of 80.

Here is a secondary objective. If any of the rows tagged for removal contains strings that matches a string stored in another vector, then append it to the string starting with ">" in the flanking row. In addition, the strings should be appended in order of their occurence in the rows targeted for removal and be preeceded by a blank space.

Here is the vector for matching the strings:

some_vector <- c("Efg", "hij", "Rst", "adb")

This is what the output should look like:

df1 <- data.frame(V1 = c(">NP_abc d Efg hij",
                        "VGSFNGWDGRRHPMRLRHPTGVWEIFVPRLQPGEVYKYEILGAHGILPLKSDPMALATTLPPDTASKISAPLKFEWHDQD",
                        ">WP_mno p Rst adb",
                        "PGLGIRLDKLASFVVTLSLYAGAYLTEVFRAGLLSIHKGQREAGLAIGLGEWQVRAYIIVPVMLRNVLPALSNNFISLFK",
                        ">GP_hgs i Hhh yuy",
                        "WEGLETPVQVVWRHALLPVIELPLAALHDPEPLNLLDAPLLRLVHAEDPDNQRIVAVLLFHHLIMDHVALDLLSHELQAV"))

Please advice. This problem is way over my head and I haven't been able to tackle it beyond figuring out the logic directing the transformation of my data frame.

Best regards!

That's from a FASTA file, right? Why not use a FASTA reader to read the data so you get them in the right format in the first place? — Konrad Rudolph, Apr 05 '23 at 08:54
FASTA format it is. I find this approach more educational for me. — Traitor Legions, Apr 05 '23 at 08:57
Maybe have a look at [Read FASTA into a dataframe and extract subsequences of FASTA file](https://stackoverflow.com/questions/21263636/) — GKi, Apr 05 '23 at 08:57
I will try it, but the real problem for me is the ocasional lack of uniformed formating of headers. I need them to look in a particular manner for further processing with BLAST run in a Linux command line. — Traitor Legions, Apr 05 '23 at 09:06
Does this discussion answer your question? https://stackoverflow.com/q/64496466/1968 — Konrad Rudolph, Apr 05 '23 at 09:10

Park · Accepted Answer · 2023-04-07T00:21:02.677

It's pretty messy but you may try

library(stringr)
library(dplyr)

key = (!is.na(str_match(df$V1, ">")) ) + (nchar(df$V1) == 80) * 2 + df$V1 %in% match$V1 

# str_match(...">") gives ">" if there is > or NA, so first part is to set key where ">" exist
# nchar(...) gets length of character of ..., and twice of T/F to make sure where row ends
# similar with str_match part, get which element matches with your match, which is now some_vector

key2 = cumsum(lag(key == 2, default = 0))

# after length 80 character, the line changes, this line is to indicate between different rows.


res <- c()

for (i in unique(key2)){ # for unique ids of rows,
  first_line <- paste0(df$V1[intersect(which(key == 1), which(key2 == i))], collapse = " ") # notice that key == 1 get start of row and matches, and use interesct to get each row's element
  second_line <- df$V1[intersect(which(key == 2), which(key2 == i))] # get end of each row
  res <- c(res, first_line, second_line)
}
res


[1] ">NP_abc d Efg hij"                                                               
[2] "VGSFNGWDGRRHPMRLRHPTGVWEIFVPRLQPGEVYKYEILGAHGILPLKSDPMALATTLPPDTASKISAPLKFEWHDQD"
[3] ">WP_mno p Rst adb"                                                               
[4] "PGLGIRLDKLASFVVTLSLYAGAYLTEVFRAGLLSIHKGQREAGLAIGLGEWQVRAYIIVPVMLRNVLPALSNNFISLFK"
[5] ">GP_hgs i Hhh yuy"                                                               
[6] "WEGLETPVQVVWRHALLPVIELPLAALHDPEPLNLLDAPLLRLVHAEDPDNQRIVAVLLFHHLIMDHVALDLLSHELQAV"

For some reason, when I run it, this code returns " " in row [5]. It does fix the corrupted headers tho. @Park — Traitor Legions, Apr 06 '23 at 08:04
@TraitorLegions Oh....I forgot to say that in your example, `>GP_hgs i Hhh yuy` didn't starts with `>` so I add `>` myself. — Park, Apr 06 '23 at 08:37
Could I ask you to provide description of what does the particular lines of your code do? @Park — Traitor Legions, Apr 06 '23 at 08:52

Remove rows flanked by two other rows: one contains a string that starts with fixed character ">", the other contains a string of 80 characters

1 Answers1