12

I have a data.frame that looks like this: enter image description here

which has 1000+ columns with similar names.

And I have a vector of those column names that looks like this: enter image description here

The vector is sorted by the cluster_id (which goes up to 11).

I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.

A simple example of what I want is that:

Data:

 A    B    C
 1    2    3
 4    5    6

Vector: c("B","C","A")

Sorted:

 B    C    A
 2    3    1
 5    6    4

Is there a fast way to do this?

user438383
  • 5,716
  • 8
  • 28
  • 43
TYZ
  • 8,466
  • 5
  • 29
  • 60
  • 1
    @AnandaMahto This is not school homework.... I don't know how to sort the columns other than using loop and check the position of the column name in the vector, and then sort it and make data frame. – TYZ Apr 17 '14 at 19:31
  • Why pictures of a table? The chance of getting the right answer increases considerably if you give a [reproducable example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Jaap Apr 17 '14 at 19:45
  • 1
    @Jaap The original data is too large. I can give you a very simple example. Check the edit. – TYZ Apr 17 '14 at 19:48
  • 1
    I have already added a simple example to the question. – TYZ Apr 17 '14 at 20:03
  • @YilunZhang Did you read the thread about reproducable examples? I'm not asking for your whole dataset. You could for example provide us with a `dput` of a part of your data. For example: dput(head(df)) – Jaap Apr 17 '14 at 20:10
  • 1
    @Jaap I have more than 1000 columns, and I cannot get all the data shown. Can't you do with the example I gave? – TYZ Apr 17 '14 at 20:17

3 Answers3

19

UPDATE, with reproducible data added by OP:

df <- read.table(h=T, text="A    B    C
    1    2    3
    4    5    6")
vec <- c("B", "C", "A")
df[vec]

Results in:

  B C A
1 2 3 1
2 5 6 4

As OP desires.


How about:

df[df.clust$mutation_id]

Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).

This basically treats df as a list and uses standard vector indexing techniques to re-order it.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • So you add a new column to df, a column of ordered column names? Then how do you sort the column in this order? – TYZ Apr 17 '14 at 19:35
  • @YilunZhang I'm not adding a column to `df`. `df[df.clust$mutation_id]` and `df[df.clust$mutation_id, ]` are completely different. The first one actually does the same as `df[,df.clust$mutation_id]` (i.e. sorts columns) for reasons related to the dual nature of data frames (they are both lists and data frames). The second one sorts rows. – BrodieG Apr 17 '14 at 19:41
13

Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.

Here's a reproducible example.

Start with some simple data:

mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
#   A B C
# 1 1 3 5
# 2 2 4 6
matches
#   X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6

Provide proof that Brodie's answer works:

out <- mydf[matches$Y]
out
#   C A B
# 1 5 1 3
# 2 6 2 4

Show a more memory efficient way to do the same thing.

library(data.table)
setDT(mydf)
mydf
#    A B C
# 1: 1 3 5
# 2: 2 4 6

setcolorder(mydf, as.character(matches$Y))
mydf
#    C A B
# 1: 5 1 3
# 2: 6 2 4
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
2

A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:

mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
#   A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order

mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"

match() find the the position of first element on the second one:

match(matches, colnames(mydf))
#[1] 2 3 1

I hope this can offer another solution if anyone is having problems!

jgarces
  • 519
  • 5
  • 17