4

I am trying to replace the null values in a polars categorical series with another literal string. The solution I had worked in an older version of polars before categoricals started using the global string cache by default.

let data = "Name
ONE
TWO
THREE
FOUR

FIVE";

    let schema = Schema::new()
        .insert_index(0, "Name".to_string(), DataType::Categorical(None))
        .unwrap();

    let buff = std::io::Cursor::new(data);

    // toggle_string_cache(true); // this is the line that fixes the error

    let frame = CsvReader::new(buff)
        .infer_schema(Some(0))
        .with_dtypes(Some(&schema))
        .finish()
        .unwrap();

    println!("{:?}", frame);
    // ┌───────┐
    // │ Name  │
    // │ ---   │
    // │ cat   │
    // ╞═══════╡
    // │ ONE   │
    // ├╌╌╌╌╌╌╌┤
    // │ TWO   │
    // ├╌╌╌╌╌╌╌┤
    // │ THREE │
    // ├╌╌╌╌╌╌╌┤
    // │ FOUR  │
    // ├╌╌╌╌╌╌╌┤
    // │ null  │
    // ├╌╌╌╌╌╌╌┤
    // │ FIVE  │
    // └───────┘

    //toggle_string_cache(true); // this is the line that doesnt work as expected

    let null_filled_frame = frame
        .clone()
        .lazy()
        .with_column(
            when(col("Name").is_null())
                .then(lit("Missing"))
                .otherwise(col("Name"))
                .alias("Name"),
        )
        .collect()
        .unwrap();

    println!("{:?}", null_filled_frame);

    // ┌─────────┐ // the expected result
    // │ Name    │
    // │ ---     │
    // │ cat     │
    // ╞═════════╡
    // │ ONE     │
    // ├╌╌╌╌╌╌╌╌╌┤
    // │ TWO     │
    // ├╌╌╌╌╌╌╌╌╌┤
    // │ THREE   │
    // ├╌╌╌╌╌╌╌╌╌┤
    // │ FOUR    │
    // ├╌╌╌╌╌╌╌╌╌┤
    // │ Missing │
    // ├╌╌╌╌╌╌╌╌╌┤
    // │ FIVE    │
    // └─────────┘

The polars CSV reader will enable the global string cache for the categorical columns and disable it afterwards. It makes sense that it will have issues when I want to change the categorical data. I get an error regarding how I can't mix data from a global cache and non-cached categorical. Enabling the global string cache before the read of the CSV allows me to modify the data as expected.

The central question is why I cannot re-enable the global cache before performing my replacement operation but after the CSV has been read.

When I do this, I get the error message "The two categorical arrays are not created under the same global string cache. They cannot be merged."

This is weird because this implies that more than one global string cache is being used when I do it this way, which does not make sense. Could someone explain?

Kival M
  • 182
  • 1
  • 10

0 Answers0