5

I need to group a dataframe by one variable and then summarising it by adding the number or rows (I can already do this) and number of columns relative to .25, .5, .75 quantiles of another variable.

In R I would do e.g.:

    iris %>%
       group_by(Species) %>%
       summarise(
          quantile(Sepal.Length, c(.25, .75)) %>%
             matrix(nrow = 1) %>%
             as.data.frame() %>%
             setNames(paste0("Sepal.Length", c(.25, .75)))
    )

What would be a concise way to write this in Julia using DataFrames and DataFrameMeta? If there's a solution to apply this to multiple columns at once even better.

The closest solution I could find in Julia was:

groupby(iris, :Species) |>
   x -> combine(x, :Sepal.Length => x -> [[map(p -> quantile(x, p), (Q25 = 0.25, Q75 = 0.75))] |> DataFrame])

but it just encapsulates the dataframe into a cell, while it should spread it into multiple columns.

Bakaburg
  • 3,165
  • 4
  • 32
  • 64

1 Answers1

3

This is shortest I can currently propose you:

combine(groupby(iris, :Species), :SepalLength => (x -> (quantile(x, [0.25, 0.75]))') => [:q25, :q75])

or similarly

combine(groupby(iris, :Species), :SepalLength => (x -> [quantile(x, [0.25, 0.75])]) => [:25, :q75])

or

combine(groupby(iris, :Species), :SepalLength .=> [x -> quantile(x, q) for q in [0.25, 0.75]] .=> [:q25, :q75])

But even your original solution seemed a bit shorter than R. Also I would rewrite it as:

combine(groupby(iris, :Species), :SepalLength => (x -> map(p -> quantile(x, p), (Q25=0.25, Q75=0.75))) => AsTable)

which seems a bit cleaner.

Now if you wanted to process multiple columns you could do (BTW - how would you do this in R?):

combine(groupby(iris, :Species), [n => (x -> (quantile(x, [0.25, 0.75]))') => [n*"_q25", n*"_q75"] 
                                  for n in ["SepalLength",  "SepalWidth", "PetalLength", "PetalWidth"]])
Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
  • Hello! For some reason just solution 3 works for me! let `df = DataFrame(a = sample(1:2, 200), b = randn(200), c = randn(200))`, so that `df` is in place of iris, `a` in place of Species and `b` and `c`as the other variables I get these errors, in the order of the examples you made: – Bakaburg Jun 07 '21 at 19:36
  • 1. ArgumentError: Unrecognized column selector: :b => (var"#5#6"() => [:q25, :q75]) 2. ArgumentError: Unrecognized column selector: :b => (var"#9#10"() => [:q25, :q75]) 3. ok 4. ArgumentError: Unrecognized column selector: :b => (var"#15#17"() => AsTable) 5. ArgumentError: Unrecognized column selector: "b" => (var"#24#26"() => ["b_q25", "b_q75"]) – Bakaburg Jun 07 '21 at 19:36
  • Are you using the current DataFrames.jl 1.1 release? It seems that you are on some older version. – Bogumił Kamiński Jun 07 '21 at 20:14
  • Ok, I'm new to Julia but I am confused. My version of DataFrames is v0.21.8 and I'm not able to update beyond that either using Pkg.update() or removing and reinstalling the package... – Bakaburg Jun 08 '21 at 09:35
  • 1
    You have some package installed that puts a restriction on DataFrames.jl version. See https://bkamins.github.io/julialang/2020/05/11/package-version-restrictions.html for an explanation how to find out what package causes that. Please let me know when you identify the issue so that I can let the package maintainer know about the fact that that package requires an update. Having said that in Julia the normal practice is to use separate environment for each project, see https://pkgdocs.julialang.org/dev/getting-started/#Getting-Started-with-Environments. And this should solve your issue. – Bogumił Kamiński Jun 08 '21 at 13:43
  • I tried to force instal v1.1.1 and this is what I get: https://gist.github.com/bakaburg1/3cd096b363418e7278a5d6ef65c4cb26 It seems to be related to PooledArrays of which I have version v0.5.3 and JuliaDB of which I have v0.13.0. – Bakaburg Jun 08 '21 at 17:48
  • 1
    You need to install `]add JuliaDB#main`, as the version of JuliaDB.jl that allows latest version of PooledArrays.jl is not released yet. Also in genral - as commented above. Probaly better do not keep JuliaDB in a global package environment, but have it only in a project specific one where you need it (and then you will not have to do the solution with `#main`). – Bogumił Kamiński Jun 08 '21 at 17:49