Selecting with Indexing is an anti-pattern in Polars: How to parse and transform (select/filter?) a CSV that seems to require so?

Question

I would like to read the following (quite broken, IMO) CSV with Pola-rs, coming from a Rigol MSO5000 oscilloscope:

D7-D0,D15-D8,t0 = -25.01s, tInc = 2e-06,
+2.470000E+02,2.000000E+00,,
+1.590000E+02,1.600000E+01,,
+2.400000E+02,2.000000E+00,,
+2.470000E+02,+1.300000E+02,,
+1.590000E+02,1.800000E+01,,
+2.470000E+02,+1.300000E+02,,
9.500000E+01,1.800000E+01,,
9.500000E+01,1.800000E+01,,
+2.400000E+02,0.000000E+00,,
(...)

Here's my current Jupyter Notebook iteration/attempt before finding out that Selecting with indexing is discouraged in Pola-rs:

import polars as pl

df = pl.read_csv("normal0.csv")

# Grab initial condition and increments
t0 = df.columns[2]; assert "t0" in t0; t0 = float(t0.split('=')[1].replace('s', '').strip())
tinc = df.columns[3]; assert "tInc" in tinc; tinc = float(tinc.split('=')[1].strip())

# TODO: Generate Series() from t0+tinc and include it in the final DataFrame

# Reshape and cleanup
probes = df.with_column(df["D7-D0"].cast(pl.Float32).alias("D0-D7 floats")) \
             .with_column(df["D15-D8"].cast(pl.Float32).alias("D15-D8 floats")) \
             .drop(df.columns[2]) \
             .drop(df.columns[3])
(...)

def split_probes(probes: pl.Series):
    ''' Splits bundles of probe cables such as D0-D7 or D15-D8 into individual dataframe columns
    '''
    out = pl.DataFrame(columns=["D"+str(line) for line in range(0,16)])
#    for row in range(probes.height):
#        for probe in range(0, 7):
#            out["D"+str(probe)].with_columns(probes["D0-D7 floats"][row % (probe + 1)])
#         for probe in reversed(range(9, 16)): # TODO: Fix future captures or parametrise this
#             outprobes["D15-D8 floats"][row % probe]

And here's my lower-level CSV parsing Rust pseudocode approach when I was told on Polars Discord that this problem might not be optimally solvable with dataframe libraries:

use csv;
use std::error::Error;
use std::io;
use std::process;
use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct OrigOscilloscope {
    #[serde(rename = "D7-D0")]
    d7_d0: String, // TODO: Unfortunately those fields are "user-flippable" in order from the scope, i.e: d0_d7 vs d7_d0
    #[serde(rename = "D15-D8")]
    d15_d8: String,
    // Do not even register those on Serde as they are empty rows anyway
    // t0: Option<String>,
    // t_inc: Option<String>
}

#[derive(Debug, Deserialize)]
struct LAProbesOscilloscopeState {
    //Vec<d0...d15>: Vec<Vec<16*f32>>, // TODO: More appropriate struct representation for the target dataframe
    d0: f32,
    d1: f32,
    d2: f32,
    d3: f32,
    d4: f32,
    d5: f32,
    d6: f32,
    d7: f32,
    d8: f32,
    d9: f32,
    d10: f32,
    d11: f32,
    d12: f32,
    d13: f32,
    d14: f32,
    d15: f32,
    timestamp: f32
}

fn run() -> Result<(), Box<dyn Error>> {
    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .flexible(true) // ignore broken header
        .from_reader(io::stdin());

    // Get timeseries information from header... :facepalm: rigol
    // Initial timestamp...
    let header = rdr.headers()?.clone();
    let t0_header: Vec<&str> = header[2].split('=').collect();
    let t0 = t0_header[1].trim_start().replace('s', "").parse::<f32>()?;
    // ...and increments
    let tinc_header: Vec<&str> = header[3].split('=').collect();
    let tinc = tinc_header[1].trim_start().parse::<f32>()?;
    println!("Initial timestamp {t0} with increments of {tinc} seconds");

    // Now do the splitting of wires from its Dx-Dy "bundles"
    let mut timestamp = t0;
    for result in rdr.deserialize().skip(1) {
        let row: OrigOscilloscope = result?;

        // if rdr.position().line().rem_euclid(8) {
        // // Read D0-D15 field(s), expanding them into the current row, matching its column
        // }
        // println!("{:#?}", row.d7_d0.parse::<f32>()?);

        // update timestamp for this row
        timestamp = timestamp + tinc;
    }
    Ok(())
}

fn main() {
    if let Err(err) = run() {
        println!("{}", err);
        process::exit(1);
    }
}

I hope that it's clear that the target dataframe desired would look like this:

d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15, timestamp
95, 95, 247, 159, 247, 240, 159, 247 (...)                            -25.01+000000.2
(...)

And the resulting code should (ideally?) be using Polars efficiently. Also, as mentioned above, avoiding "Selecting with Indexing" since it might be completely deprecated in the future in Polars.

You can use [`pl.col("D7-D0")`](https://pola-rs.github.io/polars-book/user-guide/notebooks/introduction_polars-py.html?highlight=col()#selection-context) to select columns. Does the Python code generate the correct output? — jqurious, Dec 18 '22 at 12:37
Ok, that'd select the entire column... and then? Can you elaborate on how to splice D0-D7 and D8-D15 into rows of 16 columns (plus timestamp on each of those new rows) without using Select with Indexing? — brainstorm, Dec 18 '22 at 12:40
Does the Python code currently do what you want? Or do you need help finishing the `split_probes()` function? The deprecated syntax is just the square brackets:`df["D7-D0"]` - you can replace those with `pl.col("...")` - the code will still work the same. — jqurious, Dec 18 '22 at 12:50
No, it doesn't do what I want, I'd need help finishing the split_probes() function and also attaching a (generated) timeseries to it all. — brainstorm, Dec 18 '22 at 13:08
@ritchie46 There's the desired output at the end of the post, let me know if it's not clear, happy to edit/elaborate. — brainstorm, Dec 18 '22 at 20:43

score 8 · Accepted Answer · answered Dec 18 '22 at 19:46

Perhaps you are looking for unstack (which is incredibly performant).

Let's start with this data (which simply replicates values that you provided). I've also changed the column names, just to make things easier to inspect:

import polars as pl
from io import StringIO

normal_csv = """D7-D0,D15-D8,t0 = -25.01s, tInc = 2e-06,
+2.470000E+02,2.000000E+00,,
+1.590000E+02,1.600000E+01,,
+2.400000E+02,2.000000E+00,,
+2.470000E+02,+1.300000E+02,,
+1.590000E+02,1.800000E+01,,
+2.470000E+02,+1.300000E+02,,
9.500000E+01,1.800000E+01,,
9.500000E+01,1.800000E+01,,
+2.470000E+02,2.000000E+00,,
+1.590000E+02,1.600000E+01,,
+2.400000E+02,2.000000E+00,,
+2.470000E+02,+1.300000E+02,,
+1.590000E+02,1.800000E+01,,
+2.470000E+02,+1.300000E+02,,
9.500000E+01,1.800000E+01,,
9.500000E+01,1.800000E+01,,"""


df = pl.read_csv(file=StringIO(normal_csv),
                 new_columns=('D', 'E', 't0', 'tInc'))
df

shape: (16, 4)
┌───────┬───────┬──────┬──────┐
│ D     ┆ E     ┆ t0   ┆ tInc │
│ ---   ┆ ---   ┆ ---  ┆ ---  │
│ f64   ┆ f64   ┆ str  ┆ str  │
╞═══════╪═══════╪══════╪══════╡
│ 247.0 ┆ 2.0   ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 16.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 240.0 ┆ 2.0   ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0  ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0  ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 2.0   ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 16.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 240.0 ┆ 2.0   ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0  ┆ 18.0  ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0  ┆ 18.0  ┆ null ┆ null │
└───────┴───────┴──────┴──────┘

Since I'm not familiar with your data source, I'm going to assume that your input data has a regularized pattern -- specifically that D7-D0 are always provided (i.e., no skipped lines in the csv file).

If so, here's some highly performant code that should get the ball rolling...

(
    df
    .reverse()
    .unstack(step=8,
             how='horizontal',
             columns=('D', 'E'),
             )
)

shape: (2, 16)
┌──────┬──────┬───────┬───────┬───────┬───────┬───────┬───────┬──────┬──────┬───────┬──────┬───────┬─────┬──────┬─────┐
│ D_0  ┆ D_1  ┆ D_2   ┆ D_3   ┆ D_4   ┆ D_5   ┆ D_6   ┆ D_7   ┆ E_0  ┆ E_1  ┆ E_2   ┆ E_3  ┆ E_4   ┆ E_5 ┆ E_6  ┆ E_7 │
│ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ --- │
│ f64  ┆ f64  ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64  ┆ f64  ┆ f64   ┆ f64  ┆ f64   ┆ f64 ┆ f64  ┆ f64 │
╞══════╪══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪══════╪══════╪═══════╪══════╪═══════╪═════╪══════╪═════╡
│ 95.0 ┆ 95.0 ┆ 247.0 ┆ 159.0 ┆ 247.0 ┆ 240.0 ┆ 159.0 ┆ 247.0 ┆ 18.0 ┆ 18.0 ┆ 130.0 ┆ 18.0 ┆ 130.0 ┆ 2.0 ┆ 16.0 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 95.0 ┆ 95.0 ┆ 247.0 ┆ 159.0 ┆ 247.0 ┆ 240.0 ┆ 159.0 ┆ 247.0 ┆ 18.0 ┆ 18.0 ┆ 130.0 ┆ 18.0 ┆ 130.0 ┆ 2.0 ┆ 16.0 ┆ 2.0 │
└──────┴──────┴───────┴───────┴───────┴───────┴───────┴───────┴──────┴──────┴───────┴──────┴───────┴─────┴──────┴─────┘

This doesn't have your desired column names (which you can change). Nor does it contain the timing information. But it may help you get started in the right direction.

Selecting with Indexing is an anti-pattern in Polars: How to parse and transform (select/filter?) a CSV that seems to require so?

1 Answers1