I have a dataframe that I want to sort by a set of columns in such way that I will have similar rows (based on this set of columns). I want to do this so when I apply a certain function (that will check this set of similar blocks) on my real data (large dataframe), it will be fast and won't take much time for the calculation (trying to improve the speed of my function).
Here is my sample data:
########################
####### sample data#####
########################
IDd=c(seq(1,8))
btd=c("O","A","O","AB","B","O","AB","AB")
fg= c(rep(0.0025, each=2),rep(0.00125, each=2),rep(0.0011, each=2),rep(0.0015, each=2))
data.d=data.frame(IDd,btd,A=c(1,0,1,0,0,0,0,1),B=c(1,0,1,0,0,0,0,0),C=c(0,rep(1,7)),D=rep(1,8),E=c(0,0,1,rep(0,4),1),fg,stringsAsFactors=FALSE)
data.d
IDd btd A B C D E fg
1 1 O 1 1 0 1 0 0.00250
2 2 A 0 0 1 1 0 0.00250
3 3 O 1 1 1 1 1 0.00125
4 4 AB 0 0 1 1 0 0.00125
5 5 B 0 0 1 1 0 0.00110
6 6 O 0 0 1 1 0 0.00110
7 7 AB 0 0 1 1 0 0.00150
8 8 AB 1 0 1 1 1 0.00150
####################################
# Here is what i tried
# output 1
data.d[order(data.d[,3],data.d[,4]),]
IDd btd A B C D E fg
2 2 A 0 0 1 1 0 0.00250
4 4 AB 0 0 1 1 0 0.00125
5 5 B 0 0 1 1 0 0.00110
6 6 O 0 0 1 1 0 0.00110
7 7 AB 0 0 1 1 0 0.00150
8 8 AB 1 0 1 1 1 0.00150
1 1 O 1 1 0 1 0 0.00250
3 3 O 1 1 1 1 1 0.00125
# and it does the job as expected (because i created this small data
# so i know in advance i have only 5 similar rows ): My obs 2,4,5,6 and 7
# are similar based on the columns A:E
But my Question is when i have a large dataframe (columns > 800) how could i sort my data based on the 800 columns? I tried this on the sample data but it gives me a result that i do not understand (I was expecting the same result as the output 1
)
#######################
# output 2
data.d[order(data.d[,3:7]),]
IDd btd A B C D E fg
2 2 A 0 0 1 1 0 0.00250
4 4 AB 0 0 1 1 0 0.00125
5 5 B 0 0 1 1 0 0.00110
6 6 O 0 0 1 1 0 0.00110
7 7 AB 0 0 1 1 0 0.00150
NA NA <NA> NA NA NA NA NA NA
NA.1 NA <NA> NA NA NA NA NA NA
NA.2 NA <NA> NA NA NA NA NA NA
NA.3 NA <NA> NA NA NA NA NA NA
NA.4 NA <NA> NA NA NA NA NA NA
NA.5 NA <NA> NA NA NA NA NA NA
NA.6 NA <NA> NA NA NA NA NA NA
NA.7 NA <NA> NA NA NA NA NA NA
NA.8 NA <NA> NA NA NA NA NA NA
NA.9 NA <NA> NA NA NA NA NA NA
NA.10 NA <NA> NA NA NA NA NA NA
NA.11 NA <NA> NA NA NA NA NA NA
NA.12 NA <NA> NA NA NA NA NA NA
1 1 O 1 1 0 1 0 0.00250
3 3 O 1 1 1 1 1 0.00125
8 8 AB 1 0 1 1 1 0.00150
NA.13 NA <NA> NA NA NA NA NA NA
NA.14 NA <NA> NA NA NA NA NA NA
NA.15 NA <NA> NA NA NA NA NA NA
Is there any way to sort the data (in a efficient way) by multiple columns? and do you have any idea why i get this output 2
that i do not understand? I took a look at How to sort a dataframe by multiple column(s) but it does not seem to have an anwser to my question. Thank you in advance