In Rust Polars(might apply to python pandas as well) assigning values in a new column with a complex logic involving values of other columns can be achieved in two ways. The default way is using a nested WhenThen expression. Another way to achieve same thing is with LeftJoin. Naturally I would expect When Then to be much faster than Join, but it is not the case. In this example, When Then is 6 times slower than Join. Is that actually expected? Am I using When Then wrong?
In this example the goal is to assign weights/multipliers column based on three other columns: country, city and bucket.
use std::collections::HashMap;
use polars::prelude::*;
use rand::{distributions::Uniform, Rng}; // 0.6.5
pub fn bench() {
// PREPARATION
// This MAP is to be used for Left Join
let mut weights = df![
"country"=>vec!["UK"; 5],
"city"=>vec!["London"; 5],
"bucket" => ["1","2","3","4","5"],
"weights" => [0.1, 0.2, 0.3, 0.4, 0.5]
].unwrap().lazy();
weights = weights.with_column(concat_lst([col("weights")]).alias("weihts"));
// This MAP to be used in When.Then
let weight_map = bucket_weight_map(&[0.1, 0.2, 0.3, 0.4, 0.5], 1);
// Generate the DataSet itself
let mut rng = rand::thread_rng();
let range = Uniform::new(1, 5);
let b: Vec<String> = (0..10_000_000).map(|_| rng.sample(&range).to_string()).collect();
let rc = vec!["UK"; 10_000_000];
let rf = vec!["London"; 10_000_000];
let val = vec![1; 10_000_000];
let frame = df!(
"country" => rc,
"city" => rf,
"bucket" => b,
"val" => val,
).unwrap().lazy();
// Test with Left Join
use std::time::Instant;
let now = Instant::now();
let r = frame.clone()
.join(weights, [col("country"), col("city"), col("bucket")], [col("country"), col("city"), col("bucket")], JoinType::Left)
.collect().unwrap();
let elapsed = now.elapsed();
println!("Left Join took: {:.2?}", elapsed);
// Test with nested When Then
let now = Instant::now();
let r1 = frame.clone().with_column(
when(col("country").eq(lit("UK")))
.then(
when(col("city").eq(lit("London")))
.then(rf_rw_map(col("bucket"),weight_map,NULL.lit()))
.otherwise(NULL.lit())
)
.otherwise(NULL.lit())
)
.collect().unwrap();
let elapsed = now.elapsed();
println!("Chained When Then: {:.2?}", elapsed);
// Check results are identical
dbg!(r.tail(Some(10)));
dbg!(r1.tail(Some(10)));
}
/// All this does is building a chained When().Then().Otherwise()
fn rf_rw_map(col: Expr, map: HashMap<String, Expr>, other: Expr) -> Expr {
// buf is a placeholder
let mut it = map.into_iter();
let (k, v) = it.next().unwrap(); //The map will have at least one value
let mut buf = when(lit::<bool>(false)) // buffer WhenThen
.then(lit::<f64>(0.).list()) // buffer WhenThen, needed to "chain on to"
.when(col.clone().eq(lit(k)))
.then(v);
for (k, v) in it {
buf = buf
.when(col.clone().eq(lit(k)))
.then(v);
}
buf.otherwise(other)
}
fn bucket_weight_map(arr: &[f64], ntenors: u8) -> HashMap<String, Expr> {
let mut bucket_weights: HashMap<String, Expr> = HashMap::default();
for (i, n) in arr.iter().enumerate() {
let j = i + 1;
bucket_weights.insert(
format!["{j}"],
Series::from_vec("weight", vec![*n; ntenors as usize])
.lit()
.list(),
);
}
bucket_weights
}
The result is surprising to me: Left Join took: 561.26ms
vs Chained When Then: 3.22s
Thoughts?
UPDATE
This does not make much difference. Nested WhenThen is still over 3s
// Test with nested When Then
let now = Instant::now();
let r1 = frame.clone().with_column(
when(col("country").eq(lit("UK")).and(col("city").eq(lit("London"))))
.then(rf_rw_map(col("bucket"),weight_map,NULL.lit()))
.otherwise(NULL.lit())
)
.collect().unwrap();
let elapsed = now.elapsed();
println!("Chained When Then: {:.2?}", elapsed);