0

I have a dataset with duplicate IDs, I'm trying to order the data so the duplicate rows are next to each other and ideally the whole dataset is also ordered by smallest to largest in ID value.

For example my data looks like:

Gene     ID 
Gene1    1:1001
Gene1    5:20000
Gene2    5:20000
Gene3    1:1001
Gene4    10:4000

Expected output ordered by grouping duplicates and then keeping ideally smallest to largest id order too:

Gene     ID 
Gene1    1:1001
Gene3    1:1001
Gene1    5:20000
Gene2    5:20000
Gene4    10:4000

I've been trying to do this with things like:

dfordered <- df[order[duplicated(df$ID)]]
Error: object of type 'closure' is not subsettable

Example input data:

structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene4"
), ID = c(" 1:1001, "5:20000", "5:20000", " 1:1001", 
"10:4000")), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))
DN1
  • 234
  • 1
  • 13
  • 38

1 Answers1

2

Get the first part before colon, convert to numeric, order the data and subset.

df[order(as.numeric(sub(':.*', '', df$ID))), ]

#   Gene      ID
#1 Gene1  1:1001
#4 Gene3  1:1001
#2 Gene1 5:20000
#3 Gene2 5:20000
#5 Gene4 10:4000

gtools::mixedorder does this by default.

df[gtools::mixedorder(df$ID), ]

data

df <- structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene4"
), ID = c("1:1001", "5:20000", "5:20000", "1:1001", "10:4000")), 
class = "data.frame", row.names = c(NA, -5L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213