3

I want to sort a data frame by multiple columns. Here is a simple data frame I made. How can I sort each column by a different sort type?

using DataFrames

DataFrame(b = ("Hi", "Med", "Hi", "Low"),
      levels = ("Med", "Hi", "Low"),
      x = ("A", "E", "I", "O"), y = (6, 3, 7, 2),
      z = (2, 1, 1, 2))

Ported this over from here.

logankilpatrick
  • 13,148
  • 7
  • 44
  • 125

2 Answers2

5

Unlike R, Julia's DataFrame constructor expects the values in each column to be passed as a vector rather than as a tuple: so DataFrame(b = ["Hi", "Med", "Hi", "Low"], &tc.

Also, DataFrames does not expect explicit levels to be given in the way R does it. Instead, the optional keyword argument categorical is available and should be set to "a vector of Bool indicating which columns should be converted to CategoricalVector".

(after adding the DataFrames and the CategoricalArrays packages)


julia> using DataFrames, CategoricalArrays

julia> xyorz = categorical(rand(("x","y","z"), 5))
5-element CategoricalArray{String,1,UInt32}:
 "z"
 "y"
 "x"
 "x"
 "z"

julia> smallints = rand(1:4, 5)
5-element Array{Int64,1}:
 2
 3
 2
 1
 1

julia> df = DataFrame(A = 1:5, B = xyorz, C = smallints)
5×3 DataFrame
│ Row │ A     │ B            │ C     │
│     │ Int64 │ Categorical… │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 1   │ 1     │ z            │ 2     │
│ 2   │ 2     │ y            │ 3     │
│ 3   │ 3     │ x            │ 2     │
│ 4   │ 4     │ x            │ 1     │
│ 5   │ 5     │ z            │ 1     │

now, what do you want to sort? A on (B then C)? [4, 3, 2, 5, 1]

julia> sort(df, (:B, :C))
5×3 DataFrame
│ Row │ A     │ B            │ C     │
│     │ Int64 │ Categorical… │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 1   │ 4     │ x            │ 1     │
│ 2   │ 3     │ x            │ 2     │
│ 3   │ 2     │ y            │ 3     │
│ 4   │ 5     │ z            │ 1     │
│ 5   │ 1     │ z            │ 2     │

julia> sort(df, (:B, :C)).A
5-element Array{Int64,1}:
 4
 3
 2
 5
 1

This is a good place to start http://juliadata.github.io/DataFrames.jl/stable/

Jeffrey Sarnoff
  • 1,597
  • 10
  • 11
  • Jeffrey, I particularly appreciated the "from R" approach, perfect for me (and others, I hope). In [Przemyslaw Szufel answer](https://stackoverflow.com/a/58920641) above, the frame reference used a vector `[:z,:y]` vice your use of a tuple `(:B, :C)`. While I'm guessing this is more of a "slight-efficiency" thing, is there another compelling reason to use tuples over vectors? – r2evans Nov 18 '19 at 18:57
  • 1
    In DataFrames.jl a general design is to use vectors. In this case using a tuple is allowed but will not add efficiency here, as internally it is converted to a vector anyway. – Bogumił Kamiński Nov 18 '19 at 19:55
2

Your code was creating a single row DataFrame containing a tuple so I corrected it. Note that for nominal variables you would normally used Symbols rather than Strings.

using DataFrames
df = DataFrame(b = [:Hi, :Med, :Hi, :Low, :Hi],
               x = ["A", "E", "I", "O","A"], 
               y = [6, 3, 7, 2, 1],
               z = [2, 1, 1, 2, 2])

sort(df, [:z,:y])
Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • 2
    As an additional note - as hinted by Jeffrey Sarnoff for nominal or ordinal columns a typical approach would be to use `CategoricalVector`. – Bogumił Kamiński Nov 18 '19 at 19:57