4

I want to select all columns of a DataFrame in which the datatype is a subtype of Number. However, since there are columns with missing values, the numerical column datatypes can be something like Union{Missing, Int64}.

So far, I came up with:

using DataFrames

df = DataFrame([["a", "b"], [1, missing] ,[2, 5]])

df_numerical = df[typeintersect.(colwise(eltype, df), Number) .!= Union{}]

This yields the expected result.

Question

Is there a more simple, idiomatic way of doing this? Possibly simliar to:

df.select_dtypes(include=[np.number])

in pandas as taken from an answer to this question?

TimD
  • 312
  • 3
  • 10

2 Answers2

5
julia> df[(<:).(eltypes(df),Union{Number,Missing})]
2×2 DataFrame
│ Row │ x2      │ x3 │
├─────┼─────────┼────┤
│ 1   │ 1       │ 2  │
│ 2   │ missing │ 5  │

Please note that the . is the broadcasting operator and hence I had to use <: operator in a functional form.

Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • This is a very nice solution, but will fail e.g. for `[missing, missing]` and the type of this vector does not allow numbers. – Bogumił Kamiński Sep 06 '18 at 17:58
  • I assumed that you do not have normally in a data frame columns that contain nothing else than `missing` values. However if you do you can filer them out: `df[(<:).(eltypes(df),Union{Number,Missing}).&(eltypes(df).!=Missing)]` – Przemyslaw Szufel Sep 06 '18 at 19:34
3

An other way to do it could be:

df_numerical = df[[i for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: Number]]

To retrieve all the columns that are subtype of Number, irrelevantly if they host missing data or not.

Antonello
  • 6,092
  • 3
  • 31
  • 56