2

I have a dataframe that I want to sort by a set of columns in such way that I will have similar rows (based on this set of columns). I want to do this so when I apply a certain function (that will check this set of similar blocks) on my real data (large dataframe), it will be fast and won't take much time for the calculation (trying to improve the speed of my function).

Here is my sample data:

########################
####### sample data#####
########################
IDd=c(seq(1,8))
btd=c("O","A","O","AB","B","O","AB","AB")
fg= c(rep(0.0025, each=2),rep(0.00125, each=2),rep(0.0011, each=2),rep(0.0015, each=2))
data.d=data.frame(IDd,btd,A=c(1,0,1,0,0,0,0,1),B=c(1,0,1,0,0,0,0,0),C=c(0,rep(1,7)),D=rep(1,8),E=c(0,0,1,rep(0,4),1),fg,stringsAsFactors=FALSE)

  data.d

  IDd btd A B C D E      fg
1   1   O 1 1 0 1 0 0.00250
2   2   A 0 0 1 1 0 0.00250
3   3   O 1 1 1 1 1 0.00125
4   4  AB 0 0 1 1 0 0.00125
5   5   B 0 0 1 1 0 0.00110
6   6   O 0 0 1 1 0 0.00110
7   7  AB 0 0 1 1 0 0.00150
8   8  AB 1 0 1 1 1 0.00150
####################################
# Here is what i tried 
# output 1
data.d[order(data.d[,3],data.d[,4]),]
  IDd btd A B C D E      fg
2   2   A 0 0 1 1 0 0.00250
4   4  AB 0 0 1 1 0 0.00125
5   5   B 0 0 1 1 0 0.00110
6   6   O 0 0 1 1 0 0.00110
7   7  AB 0 0 1 1 0 0.00150
8   8  AB 1 0 1 1 1 0.00150
1   1   O 1 1 0 1 0 0.00250
3   3   O 1 1 1 1 1 0.00125

# and it does the job as expected (because i created this small data 
# so i know in advance i have only 5 similar rows ): My obs 2,4,5,6 and 7 
# are similar based on the columns A:E

But my Question is when i have a large dataframe (columns > 800) how could i sort my data based on the 800 columns? I tried this on the sample data but it gives me a result that i do not understand (I was expecting the same result as the output 1)

#######################
# output 2
data.d[order(data.d[,3:7]),]
      IDd  btd  A  B  C  D  E      fg
2       2    A  0  0  1  1  0 0.00250
4       4   AB  0  0  1  1  0 0.00125
5       5    B  0  0  1  1  0 0.00110
6       6    O  0  0  1  1  0 0.00110
7       7   AB  0  0  1  1  0 0.00150
NA     NA <NA> NA NA NA NA NA      NA
NA.1   NA <NA> NA NA NA NA NA      NA
NA.2   NA <NA> NA NA NA NA NA      NA
NA.3   NA <NA> NA NA NA NA NA      NA
NA.4   NA <NA> NA NA NA NA NA      NA
NA.5   NA <NA> NA NA NA NA NA      NA
NA.6   NA <NA> NA NA NA NA NA      NA
NA.7   NA <NA> NA NA NA NA NA      NA
NA.8   NA <NA> NA NA NA NA NA      NA
NA.9   NA <NA> NA NA NA NA NA      NA
NA.10  NA <NA> NA NA NA NA NA      NA
NA.11  NA <NA> NA NA NA NA NA      NA
NA.12  NA <NA> NA NA NA NA NA      NA
1       1    O  1  1  0  1  0 0.00250
3       3    O  1  1  1  1  1 0.00125
8       8   AB  1  0  1  1  1 0.00150
NA.13  NA <NA> NA NA NA NA NA      NA
NA.14  NA <NA> NA NA NA NA NA      NA
NA.15  NA <NA> NA NA NA NA NA      NA

Is there any way to sort the data (in a efficient way) by multiple columns? and do you have any idea why i get this output 2 that i do not understand? I took a look at How to sort a dataframe by multiple column(s) but it does not seem to have an anwser to my question. Thank you in advance

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Janet
  • 225
  • 1
  • 6

1 Answers1

1

If we have multiple columns, use do.call with order on the subset of dataset

data.d[do.call(order, data.d[,3:7]),]

Or with arrange

library(dplyr)
data.d %>%
      arrange(across(all_of(3:7)))

The reason for getting that output is because

order(data.d[,3:7])
#[1]  2  4  5  6  7 10 12 13 14 15 16 17 33 34 36 37 38 39  1  3  8  9 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 35 40

it unlist the data into a vector and now we have the order for the elements of the vector instead of the order among the columns

order(unlist(data.d[3:7]))
#[1]  2  4  5  6  7 10 12 13 14 15 16 17 33 34 36 37 38 39  1  3  8  9 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 35 40

The index values showed have values greater than the nrow of the dataset. So, for each element that is greater than those, it will return a NA row because that row doesn't exist in the data

akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you @akrun , so this should work for any set of columns that i choose, but do you know why i get `NA` values when i do `data.d[order(data.d[,3:7]),]` ? – Janet Nov 03 '20 at 00:03
  • 1
    Oh Ok i see @akrun ! Thank you very much for the explanation and your answer! – Janet Nov 03 '20 at 00:04