2

how the data I work with looks(it is a SNP data):

AA CC CA GG  
GA CA CC GG  
GG CCCC CAA GG  
CA GG CC GC 

How I want it to become after case 2(row 3 is removed due to multiple characters column 2 and all columns are split into 2)

A A C C C A G G  
G A C A C C G G  
C A G G C C G C

case 1
what I use in the moment

mydata <- mydata[which(!nchar(as.character(mydata[,5]))>2),]
mydata <- mydata[which(!nchar(as.character(mydata[,6]))>2),]
mydata <- mydata[which(!nchar(as.character(mydata[,7]))>2),]

i want it to be

mydata <- mydata[which(!nchar(as.character(mydata[,5:7]))>2),]

the problem is that the function is counting all columns 5:7 and deleting every row. I want the same, but with doing it for each column, not for them together.
case 2 my code this uses libraries

library(dplyr)
library(splitstackshape)

run for each column splits the cells this is for column 6

data2$V6 = as.character(data2$V6)
data2 <- cSplit(data.frame(data2 %>% rowwise() %>%
mutate(V6 = V6, V6n = paste(unlist(strsplit(V6, "")),
collapse = ','))), "V6n", ",")
data2$V5 <- NULL

I do the same for all columns problem i want to do it for all columns potential solution: different types of loops, but I can't make it work. Any help will be appreciated

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • You should add data to your question : http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example ; For the second case at least, an expected output would also be nice. – scoa Jan 10 '17 at 11:32
  • noted. I will add sample of how the file looks – A. Stefanov Jan 10 '17 at 11:44

1 Answers1

2

Here's a fully vectorized solution in order to reach your desired ouput

## Convert all the rows into a single vectors
tmp <- do.call(paste0, mydata)

## Remove too long rows, split and rbind
do.call(rbind, strsplit(tmp[nchar(tmp) == 2 * ncol(mydata)], "", fixed = TRUE))
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] "A"  "A"  "C"  "C"  "C"  "A"  "G"  "G" 
# [2,] "G"  "A"  "C"  "A"  "C"  "C"  "G"  "G" 
# [3,] "C"  "A"  "G"  "G"  "C"  "C"  "G"  "C" 

This will result in a matrix but could be easily converted to a data.frame if needed

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • This is the solution! – A. Stefanov Jan 10 '17 at 13:35
  • Just an additional question. This worked for me, but if I have other columns with longer values like names(additional column with reference name) can I keep it for example as a row names in the matrix? – A. Stefanov Jan 10 '17 at 14:56
  • @A.Stefanov I'm not sure what you mean but you can specify row names in a `matrix` or convert it to a `data.frame` and add row names. See [here](http://stackoverflow.com/questions/16032778/how-to-give-rows-and-columns-of-a-matrix-unique-names-when-you-are-uncertain-of) for instance – David Arenburg Jan 10 '17 at 15:07
  • Just one example row: "1 snp1 0 5000650 A A A C C C A C C C C C" but I will try to find a solution It's not important for me. I'm just curious. Thanks again for the help. – A. Stefanov Jan 10 '17 at 15:32