0

I have data like below

ABCB9  
rs11057374  
rs7138100  
rs11057375  
rs12309481  
END  

ABCC10  
rs1214748  
END  

ABCC2  
rs928578  
rs10883039  
END  

ABCC4  
rs12428035  
rs9561933  
rs9302086  
rs3848077  
rs3099362    
END 

by using this data, I want to make the output like below

rs11057374  ABCB9  
rs7138100   ABCB9  
rs11057375  ABCB9  
rs12309481  ABCB9  



rs1214748  ABCC10   



rs928578    ABCC2    
rs10883039  ABCC2    



rs12428035  ABCC4    
rs9561933   ABCC4    
rs9302086   ABCC4    
rs3848077   ABCC4    
rs3099362   ABCC4  

It is not necessary whether there are blank and "END"

How make the this output in R or linux?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
J Choi
  • 67
  • 1
  • 6

1 Answers1

0

We read the dataset with readLines, remove any trailing/leading spaces (trimws), remove the elements that are blank ("") or "END", create a grouping index ('i1') based on the occurrence of 'rs' as starting characters in the strings (based on the example provided), extract the first element of 'lines2' based on 'i1' ('nm1'), split the 'lines2' by 'i1', set the names of the list element with 'nm1', remove the first observation in each element of list and stack it to convert to a data.frame.

lines1 <- trimws(lines)
lines2 <- lines1[!lines1 %in% c("END", "")]
i1 <- cumsum(!grepl("^rs", lines2))
nm1 <- lines2[ave(i1,i1, FUN=seq_along)==1]
stack(setNames(lapply(split(lines2, i1), `[`, -1), nm1))
#     values    ind
#1  rs11057374  ABCB9
#2   rs7138100  ABCB9
#3  rs11057375  ABCB9
#4  rs12309481  ABCB9
#5   rs1214748 ABCC10
#6    rs928578  ABCC2
#7  rs10883039  ABCC2
#8  rs12428035  ABCC4
#9   rs9561933  ABCC4
#10  rs9302086  ABCC4
#11  rs3848077  ABCC4
#12  rs3099362  ABCC4

data

lines <- readLines("yourdata.txt")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you for your comment. In fact, there were other type of variables in my data such as "c19_pos56113069" or "c2_pos113254071" rather than rs number. How I modify the "i1 <- cumsum(!grepl("^rs", lines2)) nm1 <- lines2[ave(i1,i1, FUN=seq_along)==1]" lines in this case? – J Choi Apr 14 '16 at 04:21
  • @JChoi I can only answer for the example you posted. Do you have `ABC` common in that case? – akrun Apr 14 '16 at 05:48
  • @akrun2 I can't understand what ABC means. I will post with a much more detailed data. If possible, let me know in below URL http://stackoverflow.com/questions/36617498/how-to-make-a-variable-by-extracting-specific-line – J Choi Apr 14 '16 at 08:23