3

I have a dataframe like this :

           G2_ref G10_ref G12_ref G2_alt G10_alt G12_alt
20011953      3      6      0      5       1     5    
12677336      0      0      0      1       3     6  
20076754      0      3      0     12      16     8 
2089670       0      4      0      1      11     9
9456633       0      2      0      3      10     0 
468487        0      0      0      0       0     0

And I'm trying to sort the columns to have finally this column order :

G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt

I tried : df[,order(colnames(df))]

But I had this order :

G10_alt G10_ref G12_alt G12_ref G2_alt G2_ref

If anyone had any idea it will be great.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Erika
  • 69
  • 8

3 Answers3

4

One option would be to extract the numeric part and also the substring at the end and then do the order

df[order(as.numeric(gsub("\\D+", "", names(df))), 
            factor(sub(".*_", "", names(df)), levels = c('ref', 'alt')))]
#          G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
#20011953      3      5       6       1       0       5
#12677336      0      1       0       3       0       6
#20076754      0     12       3      16       0       8
#2089670       0      1       4      11       0       9
#9456633       0      3       2      10       0       0
#468487        0      0       0       0       0       0

data

df <- structure(list(G2_ref = c(3L, 0L, 0L, 0L, 0L, 0L), G10_ref = c(6L, 
0L, 3L, 4L, 2L, 0L), G12_ref = c(0L, 0L, 0L, 0L, 0L, 0L), G2_alt = c(5L, 
1L, 12L, 1L, 3L, 0L), G10_alt = c(1L, 3L, 16L, 11L, 10L, 0L), 
    G12_alt = c(5L, 6L, 8L, 9L, 0L, 0L)), .Names = c("G2_ref", 
"G10_ref", "G12_ref", "G2_alt", "G10_alt", "G12_alt"), 
   class = "data.frame", row.names = c("20011953", 
"12677336", "20076754", "2089670", "9456633", "468487"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This is what it returns me when I write your command line : `head(rc[order(as.numeric(gsub("\\D+", "", names(rc))),factor(sub(".*_", "", names(rc)), levels = c('ref', 'alt')))])` `integer(0)` – Erika Feb 01 '18 at 09:21
  • @Erika I am assuming your dataset as `data.frame` (updated the dataset I used). If it is a matrix, then use comma, i,.e `rc[, order(...` – akrun Feb 01 '18 at 09:23
3

I am guessing your data is from genetics and looks pretty standard, first columns with ref alleles for all variants then followed by alt alleles for all variants.

Meaning we could just use alternated column index from half way of your dataframe, i.e.: we will try to create this index - c(1, 4, 2, 5, 3, 6) then subset:

ix <- c(rbind(seq(1, ncol(df1)/2), seq(ncol(df1)/2 + 1, ncol(df1))))
ix
# [1] 1 4 2 5 3 6

df1[, ix]
#          G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
# 20011953      3      5       6       1       0       5
# 12677336      0      1       0       3       0       6
# 20076754      0     12       3      16       0       8
# 2089670       0      1       4      11       0       9
# 9456633       0      3       2      10       0       0
# 468487        0      0       0       0       0       0

# or all in one line
df1[, c(rbind(seq(1, ncol(df1)/2), seq(ncol(df1)/2 + 1, ncol(df1))))]
zx8754
  • 52,746
  • 12
  • 114
  • 209
2

An easy solution using dplyr:

library(dplyr)
df <- df %>%
      select(G2_ref, G2_alt, G10_ref, G10_alt, G12_ref, G12_alt)

Perhaps this is less (complicated) code than @akrun's answer, but only really suitable for when you want to order a small number of columns.

tifu
  • 1,352
  • 6
  • 17
  • Thank you, but actually it's a small example, I can have more than 100 columns – Erika Feb 01 '18 at 09:24
  • 1
    If you are looking for more compact and much less complicated then `df[c(1, 4, 2, 5, 3, 6)]` and for this you don't need a package – akrun Feb 01 '18 at 09:31
  • Fair enough, but there is something to be said for working with column names rather than column numbers, especially in terms of reducing error sources. I agree your answer is the better way to go, though. – tifu Feb 01 '18 at 09:50