Merge rows if previous row contains a string that starts with a particular sign

Question

I have a data frame that looks like this:

df <- data.frame(V1=c(">A1", "aaaa", "bbb", "cccc",
            ">B2", "dddd", "eeeee","ff",
            ">C3", "ggggggg", "hhhhh", "iiiii", "jjjjj"))

This is what I want to get:

df1 <- data.frame(V1=c(">A1", "aaaabbbcccc",
            ">B2", "ddddeeeeeff",
            ">C3", "ggggggghhhhhiiiiijjjjj"))

As you can see, I want to merge every row between two rows that contain a string starting with ">" sign. Frankly, I don't know where to start with this. Please advise.

Are those `>A1` etc actually in your data frame as you shared it here? — Sotos, Mar 24 '23 at 12:40
BTW, it seems unnecessarily circuitous to use `as.data.frame(rbind(..))` here. `data.frame(V1=c(">A1","aaaa",...))` is a lot more straight forward (less obscure) to create a frame with a single column. — r2evans, Mar 24 '23 at 12:43
Thanks for pointing it @r2evans, I have edited the code to clear up my example. — Traitor Legions, Mar 24 '23 at 13:09

r2evans · Accepted Answer · 2023-03-24T12:46:40.933

3

We can use cumsum(grepl(.)) for this.

data.frame(
  V1 = unlist(
    by(df$V1, cumsum(grepl("^>", df$V1)),
       function(z) c(z[1], paste(z[-1], collapse = "")))
  )
)
#                        V1
# 11                    >A1
# 12            aaaabbbcccc
# 21                    >B2
# 22            ddddeeeeeff
# 31                    >C3
# 32 ggggggghhhhhiiiiijjjjj

Brief explanation:

grepl(.) returns TRUE for each of the >-containing cells; then

cumsum assigns that row and all rows until the next occurrence the same number:

grepl(">", df$V1)
#  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
cumsum(grepl(">", df$V1))
#  [1] 1 1 1 1 2 2 2 2 3 3 3 3 3

by(.) does something to each of those groups; in this case, it returns a vector length 2, with the >-string first and all others concatenated.

Which is structured as your df1,

df1
#                       V1
# 1                    >A1
# 2            aaaabbbcccc
# 3                    >B2
# 4            ddddeeeeeff
# 5                    >C3
# 6 ggggggghhhhhiiiiijjjjj

edited Mar 24 '23 at 12:46

answered Mar 24 '23 at 12:41

r2evans

141,215
6
77
149

2

I'm with you @AllanCameron, it seemed to make more sense to produce two columns ... – r2evans Mar 24 '23 at 12:42
1

Perhaps `grepl('^>', df$V1)` would be better than `grepl('>', df$V1)`? My guess is that this whole thing is an x-y problem, where the real issue is how to import text properly. – Allan Cameron Mar 24 '23 at 12:45
1

It does seem like an X/Y thing, sure. My guess is that either (a) the text file producer is either draconion, or setup to do something completely different; or (b) it was read in wrong or its real structure lost earlier in data processing. In either of those cases, it seems likely that the real data might require a different regex, in which case the `^` might be a red herring ... either way, added, it adds specificity. Thanks! – r2evans Mar 24 '23 at 12:48

score 1 · Answer 2 · answered Mar 24 '23 at 12:57

Assuming this was originally a fasta file, then use the dedicated package:

write(as.matrix(df), file = "tmp.fasta")

library("Biostrings")

readDNAStringSet("tmp.fasta")

# DNAStringSet object of length 3:
#     width seq                                                                 names               
# [1]    11 AAAABBBCCCC                                                         A1
# [2]     4 DDDD                                                                B2
# [3]    12 GGGGGGGHHHHH                                                        C3
# Warning message:
# In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec,  :
#   reading FASTA file tmp.fasta: ignored 17 invalid one-letter sequence codes

Related post: Read FASTA into a dataframe and extract subsequences of FASTA file

score 1 · Answer 3 · answered Mar 24 '23 at 14:13

You could use string manipulation functions to insert separators at the start and end of each group, then collapse everything into a single string (including the inserted breaks), and then split the groups up using the separators.

Using base-R

df1 <- data.frame(
    V1 = tail(strsplit(paste(sub("^(>.*)", "\n\\1\n", df$V1), collapse = ""), "\n")[[1]], -1),
    stringsAsFactors = FALSE
)

Explanation: sub() matches values with ">" and inserts newlines before and after to separate each row group. paste() combines everything into one string. strsplit() breaks the string into separate values between the newlines and tail() removes the extraneous empty group at the start.

The same steps, but spelled out a bit more clearly using dplyr

library(dplyr, warn.conflicts = FALSE)

df %>%
    # Insert separators before and after groups
    mutate(V1 = ifelse(grepl("^>", V1), paste0("\n", V1, "\n"), V1)) %>%
    # Combine all groups into a single string
    summarize(V1 = paste(V1, collapse = "")) %>%
    # Split into groups using the separators
    summarize(V1 = strsplit(V1, "\n")[[1]]) %>%
    # drop the empty group at the beginning
    filter(V1 != "")

Merge rows if previous row contains a string that starts with a particular sign

3 Answers3