0

a data I have is like below

dft<- structure(list(ATM1 = c(0.61048, 0.46609, 0.52073, 0.78661, 0.46614, 
0.60211, NA), ATM2 = c(NA, 0.874645, NA, 0.94743, NA, 0.984454, 
NA), ATM3 = c(NA, NA, NA, 0.343564, 0.163544, 0.765422, NA)), .Names = c("ATM1", 
"ATM2", "ATM3"), row.names = c("A0AV96", "A0FGR8", "2A3N6;O14986;O14617", 
"A1L020", "P54792;O14640", "CON__P15497", "Q9H3Y6;CON__H-INV:HIT000016045"
), class = "data.frame")

the row names look like this

A0AV96                         
A0FGR8                        
2A3N6;O14986;O14617            
A1L020                         
P54792;O14640                  
CON__P15497                    
Q9H3Y6;CON__H-INV:HIT000016045  

I want to remove part of any string that has CON__ or is CON__H-INV:HIT000016045

then I want to shift those string after ; as a new row with the same values as they are . for example the output of above should look like this

                                ATM1     ATM2     ATM3
A0AV96                         0.61048       NA       NA
A0FGR8                         0.46609 0.874645       NA
2A3N6                          0.52073       NA       NA
O14986                         0.52073       NA       NA
O14617                         0.52073       NA       NA
A1L020                         0.78661 0.947430 0.343564
P54792                         0.46614       NA 0.163544
O14640                         0.46614       NA 0.163544
P15497                         0.60211 0.984454 0.765422
Q9H3Y6                            NA       NA       NA

as an example, the third row has three strings separated with ; as 2A3N6;O14986;O14617 they should make two new rows with the same as where they are.

The output is like this

temp <- strsplit(gsub("(CON__|CON__H-INV:HIT000016045)", "", rownames(dft)),";")
> # use length of list to "grow" dataframe
> dftNew <- dft[rep(seq_along(temp), sapply(temp, length)), ]
> temp <- unlist(temp)
> temp[duplicated(temp)] <- paste(temp[duplicated(temp)],
+                                 seq_along(temp[duplicated(temp)]), sep=".")
> 
> rownames(dftNew) <- unlist(temp)
> dftNew$id <- rep(seq_along(temp), sapply(temp, length))
> dftNew
          ATM1     ATM2     ATM3 id
A0AV96 0.61048       NA       NA  1
A0FGR8 0.46609 0.874645       NA  2
2A3N6  0.52073       NA       NA  3
O14986 0.52073       NA       NA  4
O14617 0.52073       NA       NA  5
A1L020 0.78661 0.947430 0.343564  6
P54792 0.46614       NA 0.163544  7
O14640 0.46614       NA 0.163544  8
P15497 0.60211 0.984454 0.765422  9
Q9H3Y6      NA       NA       NA 10
nik
  • 2,500
  • 5
  • 21
  • 48

1 Answers1

2

This base R code works

# get list of rownames, with CON_ stuff dropped and split on ";"
temp <- strsplit(gsub("(CON__|CON__H-INV:HIT000016045)", "", rownames(dft)),";")
# use length of list to "grow" dataframe
dftNew <- dft[rep(seq_along(temp), sapply(temp, length)), ]
# apply new row names
rownames(dftNew) <- unlist(temp)

dftNew
          ATM1     ATM2     ATM3
A0AV96 0.61048       NA       NA
A0FGR8 0.46609 0.874645       NA
2A3N6  0.52073       NA       NA
O14986 0.52073       NA       NA
O14617 0.52073       NA       NA
A1L020 0.78661 0.947430 0.343564
P54792 0.46614       NA 0.163544
O14640 0.46614       NA 0.163544
P15497 0.60211 0.984454 0.765422
Q9H3Y6      NA       NA       NA

comment 1
If there are duplicate rownames in the final line, you will get a warning message. The data.frame will still work just fine, though you won't be able to print it to screen, for example. The easiest solution given this method is to add a subscript to the dupes as follows:

# apply new row names with dupes
temp <- unlist(temp)
temp[duplicated(temp)] <- paste(temp[duplicated(temp)],
                                seq_along(temp[duplicated(temp)]), sep=".")

rownames(dftNew) <- unlist(temp)

comment 2
To add an ID variable to map the row number of the original observation in dft to the new observations in dft2, you can reuse some of the use the previous code:

temp <- strsplit(gsub("(CON__|CON__H-INV:HIT000016045)", "", rownames(dft)),";")
dftNew$id <- rep(seq_along(temp), sapply(temp, length))
lmo
  • 37,904
  • 9
  • 56
  • 69
  • This is a warning, not an error, so it will work, but could lead to trouble in, say printing out your df. The easiest solution would be to add some subscript to the row name indicating the dupe. See my additional comment. – lmo Jun 28 '16 at 14:19
  • @nik See my second comment, which will produce and ID variable. – lmo Jun 29 '16 at 12:30
  • Have you tried it? Here is the sequence it produces in the example: 1 2 3 3 3 4 5 5 6 7. So A0AV96 gets 1; A0FGR8 gets 2; 2A3N6, O14986, and O14617 get 3. and so on. Isn't this what you were looking for? – lmo Jun 29 '16 at 13:40
  • You have to use the first version of temp from my original answer. – lmo Jun 29 '16 at 14:06
  • 1. Run through the first part of the answer to create dfNew and it's rownames 2. run through the code in comment 2. – lmo Jun 29 '16 at 14:14
  • I really think you won't like to answer me again :-)) thanks man. i make it up , one more question, would it be possible to move the id as first column . The first solution by this guy http://stackoverflow.com/questions/3369959/moving-columns-within-a-data-frame-without-retyping give the trick but I am wondering if you know any better way to do it – nik Jun 29 '16 at 14:20
  • 1
    Using rcs's answer is usually what I do, but you can do this without `grep`, see jpmarindiaz answer, the second version: `df[c("id",names(df)[-which(names(df)=="id")])]`. – lmo Jun 29 '16 at 14:26
  • This function is adding a point and integer to those strings that are duplicated, is it right? is it always adding .before integer value? or sometimes yes sometimes no ? temp[duplicated(temp)] <- paste(temp[duplicated(temp)], seq_along(temp[duplicated(temp)]), sep=".") – nik Jun 29 '16 at 15:16
  • 1
    break up the function into pieces as follows and see what you get. `duplicated(temp)` then `temp[duplicated(temp)]` then ` seq_along(temp[duplicated(temp)])` then the full thing. There should always be a "." followed by an integer. – lmo Jun 29 '16 at 15:22