Updating or adding multiple columns with pydatatable in style of R datable's .SDcols

Question

Given iris data I'd like to add new columns corresponding to all numeric columns found. I can do by explicitly listing each numeric column:

from datatable import fread, f, mean, update
iris_dt = fread("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
iris_dt[:, update(C0_dist_from_mean = dt.abs(f.C0 - mean(f.C0)),
                  C1_dist_from_mean = dt.abs(f.C1 - mean(f.C1)),
                  C2_dist_from_mean = dt.abs(f.C2 - mean(f.C2)),
                  C3_dist_from_mean = dt.abs(f.C3 - mean(f.C1)))]

But that way I hard-coded column names. More robust way is readily available with R datatable using .SDcols:

library(data.table)
iris = fread("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
cols = names(sapply(iris, class)[sapply(iris, class)=='numeric'])
iris[, paste0(cols,"_dist_from_mean") := lapply(.SD, function(x) {abs(x-mean(x))}),
     .SDcols=cols]

Is there a way to take similar approach with pydatatable today?

I do realize how to get all numeric columns in py-datatable, e.g. like this:

iris_dt[:, f[float]]

but it's the last part that uses .SDcols in R that evades me.

sammywemmy · Accepted Answer · 2021-08-05T01:24:17.387

Create a dict comprehension of the new column names and the f expressions, then unpack it in the update method:

from datatable import f, update, abs, mean

aggs = {f"{col}_dist_from_mean" : abs(f[col] - mean(f[col])) 
        for col in iris_dt[:, f[float]].names}

iris_dt[:, update(**aggs)]

UPDATE:

Using the Type properties in v1.1, this is an alternative approach :

aggs = {f"{col}_dist_from_mean" : dt.math.abs(f[col] - f[col].mean()) 
        for col, col_type 
        in zip(iris_dt.names, iris_dt.types) 
        if col_type.is_float}

You could also chunk the steps:

Create a Frame with the calculated values:

expression = f[float]-f[float].mean()
expression = dt.math.abs(expression)

compute = iris_dt[:, expression]

Rename the column labels for compute:

compute.names = [f"{name}_dist_from_mean" for name in compute.names]

Update iris_dt with compute (note that you could also use a cbind):

iris_dt[:, update(**compute)]

Updating or adding multiple columns with pydatatable in style of R datable's .SDcols

1 Answers1