how to keep the first string per line

Question

I have a data like this

    df1 <- structure(list(V1 = structure(c(1L, 2L, 3L, 5L, 4L), .Label = c("A0A061ACH4;Q95Q10;Q9U1W6", 
"A0A061ACL3;Q965I6;O76618", "A0A061ACR1;Q2XN02;F5GUA3;Q22498", 
"A0A061AL01", "H2FLH3;H2FLH2;A0A061ACT3;A0A061AE24;Q23551-2;Q23551;Q23551-4;Q23551-3;Q23551-5"
), class = "factor"), V2 = c(1L, 5L, 100L, 645L, 11L), V3 = c(67L, 
10L, 33L, 99L, 10L), V4 = c(7L, 16L, 0L, 1L, 5L)), .Names = c("V1", 
"V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -5L
))

What i want is to keep put all the strings under each other and copy the value in front of each line , in front of each string in that line
An expected output should like this

output <- structure(list(V1 = structure(c(1L, 18L, 20L, 2L, 19L, 19L, 10L, 
3L, 17L, 7L, 11L, 9L, 8L, 4L, 5L, 13L, 12L, 15L, 14L, 16L, 6L
), .Label = c("A0A061ACH4", "A0A061ACL3", "A0A061ACR1", "A0A061ACT3", 
"A0A061AE24", "A0A061AL01", "F5GUA3", "H2FLH2", "H2FLH3", "O76618", 
"Q22498", "Q23551", "Q23551-2", "Q23551-3", "Q23551-4", "Q23551-5", 
"Q2XN02", "Q95Q10", "Q965I6", "Q9U1W6"), class = "factor"), V2 = c(1L, 
1L, 1L, 5L, 5L, 5L, 5L, 100L, 100L, 100L, 100L, 645L, 645L, 645L, 
645L, 645L, 645L, 645L, 645L, 645L, 11L), V3 = c(67L, 67L, 67L, 
10L, 10L, 10L, 10L, 33L, 33L, 33L, 33L, 99L, 99L, 99L, 99L, 99L, 
99L, 99L, 99L, 99L, 10L), V4 = c(7L, 7L, 7L, 16L, 16L, 16L, 16L, 
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L)), .Names = c("V1", 
"V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -21L
))

For example, if we look at df1 first line, it looks like this

A0A061ACH4;Q95Q10;Q9U1W6    1   67  7

Then I split the strings after ; and put them under each other and copy the values in front of them. so for the first line of the df1, I will have

  A0A061ACH4    1   67  7
    Q95Q10      1   67  7
    Q9U1W6      1   67  7

Notice that there is a slight mistake in your expected output. The result should be 20 lines. — Roman Luštrik, Nov 12 '16 at 13:26

akrun · Answer 1 · 2016-11-12T13:28:05.053

2

This can be done with separate_rows from tidyr

library(tidyr)
separate_rows(df1, V1, sep=";")

Or use cSplit with long option

library(splitstackshape)
cSplit(df1, 'V1', ';', 'long')

Or using another base R option

lst <- strsplit(as.character(df1$V1), ";")
cbind(V1= unlist(lst), df1[rep(1:nrow(df1), lengths(lst)),-1])

If we need only the substring before the first ;, use sub to match the pattern ; followed by characters until the end of the string (.*) and replace it with blank ("").

df1$V1 <- sub(";.*", "", df1$V1)

edited Nov 12 '16 at 13:28

answered Nov 12 '16 at 13:17

akrun

874,273
37
540
662

is there any other way that does not use packages ? also is it possible to not repeat it and just remove the other strings after ; from each line and keep the first one per line? – Learner Algorithm Nov 12 '16 at 13:24
@LearnerAlgorithm Yes, it is possible (`df1$V1 <- sub(";.*", "", df1$V1)`) , but I looked at your expected output (`output`) and this is what it showed – akrun Nov 12 '16 at 13:25
thanks, above the person gave me a solution without any package use! I cannot thank you for you time because i am new here and i don't have reputation :-( – Learner Algorithm Nov 12 '16 at 13:32
@LearnerAlgorithm It's okay – akrun Nov 12 '16 at 13:33

score 1 · Accepted Answer · answered Nov 12 '16 at 13:25

Here is a "base" way.

# for each line...
out <- do.call(rbind, apply(df1, MARGIN = 1, FUN = function(x) {
  # split by ; and...
  do.call(rbind, sapply(unlist(strsplit(x[1], ";")), FUN = function(y, y2) {
    # ... append the rest of the columns to individual element
    c(y, y2)
  }, simplify = FALSE, y2 = x[2:4]))
  }))

rownames(out) <- NULL

out <- as.data.frame(out)

out

           V1  V2 V3 V4
1  A0A061ACH4   1 67  7
2      Q95Q10   1 67  7
3      Q9U1W6   1 67  7
4  A0A061ACL3   5 10 16
5      Q965I6   5 10 16
6      O76618   5 10 16
7  A0A061ACR1 100 33  0
8      Q2XN02 100 33  0
9      F5GUA3 100 33  0
10     Q22498 100 33  0
11     H2FLH3 645 99  1
12     H2FLH2 645 99  1
13 A0A061ACT3 645 99  1
14 A0A061AE24 645 99  1
15   Q23551-2 645 99  1
16     Q23551 645 99  1
17   Q23551-4 645 99  1
18   Q23551-3 645 99  1
19   Q23551-5 645 99  1
20 A0A061AL01  11 10  5

how to keep the first string per line

2 Answers2