1

In Rust Polars(might apply to python pandas as well) assigning values in a new column with a complex logic involving values of other columns can be achieved in two ways. The default way is using a nested WhenThen expression. Another way to achieve same thing is with LeftJoin. Naturally I would expect When Then to be much faster than Join, but it is not the case. In this example, When Then is 6 times slower than Join. Is that actually expected? Am I using When Then wrong?

In this example the goal is to assign weights/multipliers column based on three other columns: country, city and bucket.

use std::collections::HashMap;

use polars::prelude::*;
use rand::{distributions::Uniform, Rng}; // 0.6.5

pub fn bench() {
    // PREPARATION
    // This MAP is to be used for Left Join
    let mut weights = df![
        "country"=>vec!["UK"; 5],
        "city"=>vec!["London"; 5],
        "bucket" => ["1","2","3","4","5"],
        "weights" => [0.1, 0.2, 0.3, 0.4, 0.5]
    ].unwrap().lazy();
    weights = weights.with_column(concat_lst([col("weights")]).alias("weihts"));

    // This MAP to be used in When.Then
    let weight_map = bucket_weight_map(&[0.1, 0.2, 0.3, 0.4, 0.5], 1);


    // Generate the DataSet itself
    let mut rng = rand::thread_rng();
    let range = Uniform::new(1, 5);
    let b: Vec<String> = (0..10_000_000).map(|_| rng.sample(&range).to_string()).collect();
    let rc = vec!["UK"; 10_000_000];
    let rf = vec!["London"; 10_000_000];
    let val = vec![1; 10_000_000];
    let frame = df!(
        "country" => rc,
        "city" => rf,
        "bucket" => b,
        "val" => val,
    ).unwrap().lazy();

    // Test with Left Join
    use std::time::Instant;
    let now = Instant::now();
    let r = frame.clone()
        .join(weights, [col("country"), col("city"), col("bucket")], [col("country"), col("city"), col("bucket")], JoinType::Left)
        .collect().unwrap();
    let elapsed = now.elapsed();
    println!("Left Join took: {:.2?}", elapsed);

    // Test with nested When Then
    let now = Instant::now();
    let r1 = frame.clone().with_column(
        when(col("country").eq(lit("UK")))
            .then(
                when(col("city").eq(lit("London")))
                .then(rf_rw_map(col("bucket"),weight_map,NULL.lit()))
                .otherwise(NULL.lit())
            )
            .otherwise(NULL.lit())
        )
        .collect().unwrap();
    let elapsed = now.elapsed();
    println!("Chained When Then: {:.2?}", elapsed);

    // Check results are identical
    dbg!(r.tail(Some(10)));
    dbg!(r1.tail(Some(10)));
}

/// All this does is building a chained When().Then().Otherwise()
fn rf_rw_map(col: Expr, map: HashMap<String, Expr>, other: Expr) -> Expr {
    // buf is a placeholder
    let mut it = map.into_iter();
    let (k, v) = it.next().unwrap(); //The map will have at least one value

    let mut buf = when(lit::<bool>(false)) // buffer WhenThen
        .then(lit::<f64>(0.).list()) // buffer WhenThen, needed to "chain on to"
        .when(col.clone().eq(lit(k)))
        .then(v);

    for (k, v) in it {
        buf = buf
            .when(col.clone().eq(lit(k)))
            .then(v);
    }
    buf.otherwise(other)
}

fn bucket_weight_map(arr: &[f64], ntenors: u8) -> HashMap<String, Expr> {
    let mut bucket_weights: HashMap<String, Expr> = HashMap::default();
    for (i, n) in arr.iter().enumerate() {
        let j = i + 1;
        bucket_weights.insert(
            format!["{j}"],
            Series::from_vec("weight", vec![*n; ntenors as usize])
                .lit()
                .list(),
        );
    }
    bucket_weights
}

The result is surprising to me: Left Join took: 561.26ms vs Chained When Then: 3.22s

Thoughts?

UPDATE

This does not make much difference. Nested WhenThen is still over 3s

// Test with nested When Then
    let now = Instant::now();
    let r1 = frame.clone().with_column(
        when(col("country").eq(lit("UK")).and(col("city").eq(lit("London"))))
            .then(rf_rw_map(col("bucket"),weight_map,NULL.lit()))
            .otherwise(NULL.lit())
        )
        .collect().unwrap();
    let elapsed = now.elapsed();
    println!("Chained When Then: {:.2?}", elapsed);
Anatoly Bugakov
  • 772
  • 1
  • 7
  • 18

1 Answers1

1

The joins are one of the most optimized algorithms in polars. A left join will be executed fully in parallel and has many performance related fast paths. If you want to combine data based on equality, you should almost always choose a join.

ritchie46
  • 10,405
  • 1
  • 24
  • 43