Multi-level indexing of data frames in Julia?

Question

May I know how to apply multi-level indexing on data frames in Julia? Or is there any other method, approach or package to achieve this objective.

Update

Example python code:

import numpy as np
import pandas as pd
arrays = [np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
          np.array(["one", "two", "one", "two", "one", "two", "one", "two"]), ]

df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

Output:->

Thanks!!

can you please give code example e.g. in pandas (as I assume this is your source of the question) what you want to achieve? Then I can suggest you how to do it in DataFrames.jl. In general use `groupby` to add an index on as many columns as you wish to a data frame. — Bogumił Kamiński, Apr 14 '21 at 08:31
@BogumilKaminski thanks for the response, i have updated the question with the code. I will also try to work on your suggestion. — Mohammad Saad, Apr 14 '21 at 10:58

score 6 · Accepted Answer · answered Apr 14 '21 at 13:40

I understand your question but the point is what do you need to use the index for.

Here is how groupby works:

julia> using DataFrames

julia> df = DataFrame(x=repeat(["bar", "baz"], inner=3), y=repeat(["one", "two"], outer=3), z=1:6)
6×3 DataFrame
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ bar     one         1
   2 │ bar     two         2
   3 │ bar     one         3
   4 │ baz     two         4
   5 │ baz     one         5
   6 │ baz     two         6

julia> groupby(df, :x) # 1-level index
GroupedDataFrame with 2 groups based on key: x
First Group (3 rows): x = "bar"
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ bar     one         1
   2 │ bar     two         2
   3 │ bar     one         3
⋮
Last Group (3 rows): x = "baz"
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ baz     two         4
   2 │ baz     one         5
   3 │ baz     two         6

julia> groupby(df, :y) # 1-level index
GroupedDataFrame with 2 groups based on key: y
First Group (3 rows): y = "one"
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ bar     one         1
   2 │ bar     one         3
   3 │ baz     one         5
⋮
Last Group (3 rows): y = "two"
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ bar     two         2
   2 │ baz     two         4
   3 │ baz     two         6

julia> groupby(df, [:x, :y]) # 2-level index
GroupedDataFrame with 4 groups based on keys: x, y
First Group (2 rows): x = "bar", y = "one"
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ bar     one         1
   2 │ bar     one         3
⋮
Last Group (1 row): x = "baz", y = "one"
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ baz     one         5

Now an example of indexing for 2-level index:

julia> gdf = groupby(df, [:x, :y]) # 2-level index
GroupedDataFrame with 4 groups based on keys: x, y
First Group (2 rows): x = "bar", y = "one"
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ bar     one         1
   2 │ bar     one         3
⋮
Last Group (1 row): x = "baz", y = "one"
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ baz     one         5

julia> gdf[("bar", "two")]
1×3 SubDataFrame
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ bar     two         2

julia> gdf[("baz", "two")]
2×3 SubDataFrame
 Row │ x       y       z
     │ String  String  Int64
─────┼───────────────────────
   1 │ baz     two         4
   2 │ baz     two         6

Now there is a difference between DataFrames.jl and Pandas in indexing. For Pandas you have (see here for benchmarks):

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

while for DataFrames.jl no matter what source columns you use for indexing lookup is always O(1).

I came to this because I had a similar question. And I think I can understand why DataFrames.jl chooses to keep linear indexing always, but to me it seems strange that the columns by which you group the df remain columns in the groups. If I want to iterate over the "content columns", I *do not* want to iterate over the "index columns". Is there a neat way to do that? — KeithWM, Sep 17 '21 at 19:40
you can get a list of non-grouping columns using the `valuecols` function and a list of grouping columns using the `groupcols` function. The reason why we keep both is that often user might want to condition the operation performed on the value of grouping columns and if we excluded them it would not be possible to do so. It is similar like in SQL. When you do `GROUP BY` you still have access to grouping columns. — Bogumił Kamiński, Sep 17 '21 at 21:31

score 1 · Answer 2 · answered Apr 14 '21 at 08:30

1

Do you mean something like this?

julia> # Initialise data structure
       a = [
         [1,2],
         [3,4,5]
       ]
2-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4, 5]

julia> # Do multilevel indexing

julia> a[1][1]
1

julia> a[2][3]
5

answered Apr 14 '21 at 08:30

gTcV

2,446
1
15
32

thanks for the suggestion, this is an amazing approach. But i wonder if i can use it as something in comparison to `pandas.Multi_index()` ? – Mohammad Saad Apr 14 '21 at 11:00
Maybe i will try to convert my dataframe to matrix and then use this approach – Mohammad Saad Apr 14 '21 at 11:01

Multi-level indexing of data frames in Julia?

Update

2 Answers2

Linked