How to apply a function to multiple columns of a polars DataFrame in Rust

Question

I'd like to apply a user-define function which takes a few inputs (corresponding some columns in a polars DataFrame) to some columns of a polars DataFrame in Rust. The pattern that I'm using is as below. I wonder is this the best practice?

fn my_filter_func(col1: &Series, col2: &Series, col2: &Series) -> ReturnType {
    let it = (0..n).map(|i| {
        let col1 = match col.get(i) {
            AnyValue::UInt64(val) => val,
            _ => panic!("Wrong type of col1!"),
        };
        // similar for col2 and col3
        // apply user-defined function to col1, col2 and col3
    }
    // convert it to a collection of the required type
}

If your code is working most probably this is better suited for : https://codereview.stackexchange.com/ — Netwave, May 25 '22 at 07:31
this might help https://stackoverflow.com/questions/70386839/writing-expression-in-polars-lazy-in-rust — Anatoly Bugakov, May 26 '22 at 17:59

ritchie46 · Answer 1 · 2022-05-26T06:28:18.187

5

You can downcast the Series to the proper type you want to iterate over, and then use rust iterators to apply your logic.

fn my_black_box_function(a: f32, b: f32) -> f32 {
    // do something
    a
}

fn apply_multiples(col_a: &Series, col_b: &Series) -> Float32Chunked {
    match (col_a.dtype(), col_b.dtype()) {
        (DataType::Float32, DataType::Float32) => {
            let a = col_a.f32().unwrap();
            let b = col_b.f32().unwrap();

            a.into_iter()
                .zip(b.into_iter())
                .map(|(opt_a, opt_b)| match (opt_a, opt_b) {
                    (Some(a), Some(b)) => Some(my_black_box_function(a, b)),
                    _ => None,
                })
                .collect()
        }
        _ => panic!("unpexptected dtypes"),
    }
}

Lazy API

You don't have to leave the lazy API to be able to access my_black_box_function.

We can collect the columns we want to apply in a Struct data type and then apply a closure over that Series.

fn apply_multiples(lf: LazyFrame) -> Result<DataFrame> {
    df![
        "a" => [1.0, 2.0, 3.0],
        "b" => [3.0, 5.1, 0.3]
    ]?
    .lazy()
    .select([concat_lst(["col_a", "col_b"]).map(
        |s| {
            let ca = s.struct_()?;

            let b = ca.field_by_name("col_a")?;
            let a = ca.field_by_name("col_b")?;
            let a = a.f32()?;
            let b = b.f32()?;

            let out: Float32Chunked = a
                .into_iter()
                .zip(b.into_iter())
                .map(|(opt_a, opt_b)| match (opt_a, opt_b) {
                    (Some(a), Some(b)) => Some(my_black_box_function(a, b)),
                    _ => None,
                })
                .collect();

            Ok(out.into_series())
        },
        GetOutput::from_type(DataType::Float32),
    )])
    .collect()
}

edited May 26 '22 at 06:28

answered May 26 '22 at 06:08

ritchie46

10,405
1
24
43

Does the lazy API loads rows when needed or does it try to load the whole file at once? It seems to me that columns are loaded and converted to ChunkedArrays, which means that all rows are loaded into memory? – Benjamin Du May 26 '22 at 15:41
I don't think concat_lst is defined anywhere – Anatoly Bugakov May 26 '22 at 17:58
1

Not all feature flags are activated by default. It does exist, but you need to activate the `list` feature. – ritchie46 May 26 '22 at 18:19
If the whole file is loaded into memory any way, there's really no advantage of using LazyFrame APIs in this case, right? – Benjamin Du May 27 '22 at 01:51
Your question does not mention a file. That would be another question. – ritchie46 May 27 '22 at 08:11
Second example returns an error Err(SchemaMisMatch(Owned("Series of dtype: List(Float64) != Struct",),),) – Ilya Dyachenko Jan 11 '23 at 03:30
1

Instead of concat_lst need to use as_struct – Ilya Dyachenko Jan 11 '23 at 04:29
Yep, need to use `as_struct(["col_a", "col_b"])` and also enable feature `dtype-struct` – BallpointBen Mar 06 '23 at 18:24

Anatoly Bugakov · Answer 2 · 2022-05-27T09:43:27.493

The solution I found working for me is with map_multiple(my understanding - this to be used if no groupby/agg) or apply_multiple(my understanding - whenerver you have groupby/agg). Alternatively, you could also use map_many or apply_many. See below.

use polars::prelude::*;
use polars::df;

fn main() {
    let df = df! [
        "names" => ["a", "b", "a"],
        "values" => [1, 2, 3],
        "values_nulls" => [Some(1), None, Some(3)],
        "new_vals" => [Some(1.0), None, Some(3.0)]
    ].unwrap();

    println!("{:?}", df);

    //df.try_apply("values_nulls", |s: &Series| s.cast(&DataType::Float64)).unwrap();

    let df = df.lazy()
        .groupby([col("names")])
        .agg( [
            total_delta_sens().sum()
        ]
        );

    println!("{:?}", df.collect());
}

pub fn total_delta_sens () -> Expr {
    let s: &mut [Expr] = &mut [col("values"), col("values_nulls"),  col("new_vals")];

    fn sum_fa(s: &mut [Series])->Result<Series>{
        let mut ss = s[0].cast(&DataType::Float64).unwrap().fill_null(FillNullStrategy::Zero).unwrap().clone();
        for i in 1..s.len(){
            ss = ss.add_to(&s[i].cast(&DataType::Float64).unwrap().fill_null(FillNullStrategy::Zero).unwrap()).unwrap();
        }
        Ok(ss) 
    }

    let o = GetOutput::from_type(DataType::Float64);
    map_multiple(sum_fa, s, o)
}

Here total_delta_sens is just a wrapper function for convenience. You don't have to use it.You can do directly this within your .agg([]) or .with_columns([]) : lit::<f64>(0.0).map_many(sum_fa, &[col("norm"), col("uniform")], o)

Inside sum_fa you can as Richie already mentioned downcast to ChunkedArray and .iter() or even .par_iter() Hope that helps

Claudio Fsr · Answer 3 · 2023-06-26T13:36:09.210

This example uses Rust Polars version = "0.30"

Apply the same function on the selected columns:

lazyframe
.with_columns([
    cols(col_name1, col_name2, ..., col_nameN)
   .apply(|series| 
       some_function(series), 
       GetOutput::from_type(DataType::Float64)
   )
]);

Or apply many functions to many columns with a much more flexible and powerful method:

lazyframe
.with_columns([
    some_function1(col_name1, col_name2, ..., col_nameN), 
    some_function2(col_name1, col_name2, ..., col_nameM),
    some_functionN(col_name1, col_name2, ..., col_nameZ),
]);

The Cargo.toml:

[dependencies]
polars = { version = "0.30", features = [
    "lazy", # Lazy API
    "round_series", # round underlying float types of Series
] }

And the main() function:

use std::error::Error;

use polars::{
    prelude::*,
    datatypes::DataType,
};

fn main()-> Result<(), Box<dyn Error>> {

    let dataframe01: DataFrame = df!(
        "integers"  => &[1, 2, 3, 4, 5, 6],
        "float64 A" => [23.654, 0.319, 10.0049, 89.01999, -3.41501, 52.0766],
        "options"   => [Some(28), Some(300), None, Some(2), Some(-30), None],
        "float64 B" => [9.9999, 0.399, 10.0061, 89.0105, -3.4331, 52.099999],
    )?;

    println!("dataframe01: {dataframe01}\n");

    // let selected: Vec<&str> = vec!["float64 A", "float64 B"];

    // Example 1:
    // Format only the columns with float64
    // input: two columns --> output: two columns

    let lazyframe: LazyFrame = dataframe01
        .lazy()
        .with_columns([
            //cols(selected)
            all()
            .apply(|series| 
                round_float64_columns(series, 2),
                GetOutput::same_type()
             )
         ]);

    let dataframe02: DataFrame = lazyframe.clone().collect()?;

    println!("dataframe02: {dataframe02}\n");

    let series_a: Series = Series::new("float64 A", &[23.65, 0.32, 10.00, 89.02, -3.42, 52.08]);
    let series_b: Series = Series::new("float64 B", &[10.00,  0.4, 10.01, 89.01, -3.43, 52.1]);

    assert_eq!(dataframe02.column("float64 A")?, &series_a);
    assert_eq!(dataframe02.column("float64 B")?, &series_b);

    // Example 2:
    // input1: two columns --> output: one new column
    // input2: one column  --> output: one new column
    // input3: two columns --> output: one new column

    let lazyframe: LazyFrame = lazyframe
        .with_columns([
            apuracao1("float64 A", "float64 B", "New Column 1"),
            apuracao2("float64 A", "New Column 2"),
            (col("integers") * lit(10) + col("options")).alias("New Column 3"),
         ]);

    println!("dataframe03: {}\n", lazyframe.collect()?);

    Ok(())
}

pub fn round_float64_columns(series: Series, decimals: u32) -> Result<Option<Series>, PolarsError> {
    match series.dtype() {
        DataType::Float64 => Ok(Some(series.round(decimals)?)),
        _ => Ok(Some(series))
    }
}

fn apuracao1(name_a: &str, name_b: &str, new: &str) -> Expr {
    (col(name_a) * col(name_b) / lit(100))
    //.over("some_group")
    .alias(new)
}

fn apuracao2(name_a: &str, new: &str) -> Expr {
    (lit(10) * col(name_a) - lit(2))
    //.over("some_group")
    .alias(new)
}

The output:

dataframe01: shape: (6, 4)
┌──────────┬───────────┬─────────┬───────────┐
│ integers ┆ float64 A ┆ options ┆ float64 B │
│ ---      ┆ ---       ┆ ---     ┆ ---       │
│ i32      ┆ f64       ┆ i32     ┆ f64       │
╞══════════╪═══════════╪═════════╪═══════════╡
│ 1        ┆ 23.654    ┆ 28      ┆ 9.9999    │
│ 2        ┆ 0.319     ┆ 300     ┆ 0.399     │
│ 3        ┆ 10.0049   ┆ null    ┆ 10.0061   │
│ 4        ┆ 89.01999  ┆ 2       ┆ 89.0105   │
│ 5        ┆ -3.41501  ┆ -30     ┆ -3.4331   │
│ 6        ┆ 52.0766   ┆ null    ┆ 52.099999 │
└──────────┴───────────┴─────────┴───────────┘

dataframe02: shape: (6, 4)
┌──────────┬───────────┬─────────┬───────────┐
│ integers ┆ float64 A ┆ options ┆ float64 B │
│ ---      ┆ ---       ┆ ---     ┆ ---       │
│ i32      ┆ f64       ┆ i32     ┆ f64       │
╞══════════╪═══════════╪═════════╪═══════════╡
│ 1        ┆ 23.65     ┆ 28      ┆ 10.0      │
│ 2        ┆ 0.32      ┆ 300     ┆ 0.4       │
│ 3        ┆ 10.0      ┆ null    ┆ 10.01     │
│ 4        ┆ 89.02     ┆ 2       ┆ 89.01     │
│ 5        ┆ -3.42     ┆ -30     ┆ -3.43     │
│ 6        ┆ 52.08     ┆ null    ┆ 52.1      │
└──────────┴───────────┴─────────┴───────────┘

dataframe03: shape: (6, 7)
┌──────────┬───────────┬─────────┬───────────┬──────────────┬──────────────┬──────────────┐
│ integers ┆ float64 A ┆ options ┆ float64 B ┆ New Column 1 ┆ New Column 2 ┆ New Column 3 │
│ ---      ┆ ---       ┆ ---     ┆ ---       ┆ ---          ┆ ---          ┆ ---          │
│ i32      ┆ f64       ┆ i32     ┆ f64       ┆ f64          ┆ f64          ┆ i32          │
╞══════════╪═══════════╪═════════╪═══════════╪══════════════╪══════════════╪══════════════╡
│ 1        ┆ 23.65     ┆ 28      ┆ 10.0      ┆ 2.365        ┆ 234.5        ┆ 38           │
│ 2        ┆ 0.32      ┆ 300     ┆ 0.4       ┆ 0.00128      ┆ 1.2          ┆ 320          │
│ 3        ┆ 10.0      ┆ null    ┆ 10.01     ┆ 1.001        ┆ 98.0         ┆ null         │
│ 4        ┆ 89.02     ┆ 2       ┆ 89.01     ┆ 79.236702    ┆ 888.2        ┆ 42           │
│ 5        ┆ -3.42     ┆ -30     ┆ -3.43     ┆ 0.117306     ┆ -36.2        ┆ 20           │
│ 6        ┆ 52.08     ┆ null    ┆ 52.1      ┆ 27.13368     ┆ 518.8        ┆ null         │
└──────────┴───────────┴─────────┴───────────┴──────────────┴──────────────┴──────────────┘

How to apply a function to multiple columns of a polars DataFrame in Rust

3 Answers3

Lazy API

Linked