Split long dataframe values into smaller frame values having repeated columns

Question

I am using vcfR package. I have to break the repeated col. values into same col vertically displaying the output.

For e.g.:

Sample Chr p-value AF MQ   Sample Chr p-value AF MQ   Sample Chr p-value AF MQ    
A1      1  0.0533  30 40     A1    1  0.0633  35 45    A1     2  0.0753  35 45

I am trying to get the output,

Sample    Chr    p-value    AF     MQ
A1         1     0.0533     30     40  
A1         1     0.0633     35     45  
A1         2     0.0753     35     45

I am trying with groupby function, sapply function but unable to get such output as displayed above in the frame. Pls help..

It's private data. I have a long dataset in which each field contain lot of parameters. I separated the parameters and transposed them in a matrix. Now I have to make that matrix into smaller frame in which few cols will be present and their values. If the col repeats then it's correspondent value will be splitted under respective col but col heading will not come. It is as mentioned in the sample, exactly such output I need. — Rajkishore90, Jan 29 '18 at 10:09
Yes, it is wide with repeated cols and their values. I have to put values of respective cols under col names preventing them to repeat .. — Rajkishore90, Jan 29 '18 at 10:23
There might be a better way if you could show us the vcfR code that produced your output. — zx8754, Jan 29 '18 at 10:27

chinsoon12 · Answer 1 · 2018-01-29T10:41:31.557

0

You can create a list of column indices, access the set of columns to get a list and then rbind the elements in the list

numCols <- unique(diff(which(colnames(df)=="Sample")))
indices <- split(seq_len(ncol(df)), ceiling(seq_len(ncol(df))/numCols))
do.call(rbind, lapply(indices, function(x) {
    df[,x]      
}))

data:

df <- read.table(text="Sample Chr p-value AF MQ   Sample Chr p-value AF MQ   Sample Chr p-value AF MQ    
A1      1  0.0533  30 40     A1    1  0.0633  35 45    A1     2  0.0753  35 45", 
    header=TRUE,
    check.names=FALSE)

based on OP's comments, it seems like the original data was in a long format with colnames in one column and data in the 2nd column as follows:

df2 <- data.frame(V1=c("Sample","Chr","p-value","AF","MQ","Sample","Chr","p-value","AF","MQ"),
    V2=c("A1","1","0.0533","30","40","A1","1","0.0633","35","45"))

we can do something similar by splitting along the rows as follows, then rbind the results

do.call(rbind, lapply(split(df2$V2, cumsum(df2$V1=="Sample")), t))

edited Jan 29 '18 at 10:41

answered Jan 29 '18 at 10:17

chinsoon12

25,005
4
25
35

Thanks, but here I have shown 5 cols. I have more cols assume like >50/60 .. and this cols are dynamic, it will change with each dataset. In this case how will I divide ? – Rajkishore90 Jan 29 '18 at 10:21
do you mean how do we get the number of cols instead of hardcoding as 5? something like `unique(diff(which(colnames(df)=="Sample")))` ? – chinsoon12 Jan 29 '18 at 10:21
In the wide data.frame I already have the col names with their values. It is directly reading from the dataset (preprocessed), Final output till now I have got is .. assume in V1 I have all the col names and V2 I have all the values. They are wide as I hv used transposed func. Now I have to split them as shown in eg.. , V1 will have col name without repeating the col names (as shown in eg), V2 will contain the o/p. – Rajkishore90 Jan 29 '18 at 10:28
so you are trying to convert from long wide format? check out `reshape::dcast`, `data.table::dcast.data.table` or `tidyr::spread`. And you really need to update your question. – chinsoon12 Jan 29 '18 at 10:34

score 0 · Answer 2 · answered Jan 29 '18 at 10:22

Here is an alternative solution:

df = read.table(text='Sample Chr p-value AF MQ   Sample Chr p-value AF MQ   Sample Chr p-value AF MQ    
A1      1  0.0533  30 40     A1    1  0.0633  35 45    A1     2  0.0753  35 45',header=T,check.names=F)

library(data.table)
group = ave(seq(ncol(df)), colnames(df), FUN = seq_along)
rbindlist(lapply(seq(max(group)), function(x) {df[,which(group==x)]}),fill=T)

Output:

   Sample Chr p-value AF MQ
1:     A1   1  0.0533 30 40
2:     A1   1  0.0633 35 45
3:     A1   2  0.0753 35 45

Note that this requires that there are no columns that are not duplicated per group. If there is though, you should remove the column with for example df = df[, colnames(df) %in% unique(colnames(df)[duplicated(colnames(df))])].

Hope this helps!

Split long dataframe values into smaller frame values having repeated columns

2 Answers2