5

I have a polars DataFrame with multiple numeric (float dtype) columns. I want to write some of them to a csv with a certain number of decimal places. The number of decimal places I want is column-specific.

polars offers format:

import polars as pl

df = pl.DataFrame({"a": [1/3, 1/4, 1/7]})

df.select(
    [
        pl.format("as string {}", pl.col("a")),
        ]
    )

shape: (3, 1)
┌───────────────────────────────┐
│ literal                       │
│ ---                           │
│ str                           │
╞═══════════════════════════════╡
│ as string 0.3333333333333333  │
│ as string 0.25                │
│ as string 0.14285714285714285 │
└───────────────────────────────┘

However, if I try to set a directive to specify number of decimal places, it fails:

df.select(
    [
        pl.format("{:.3f}", pl.col("a")),
        ]
)

ValueError: number of placeholders should equal the number of arguments

Is there an option to have "real" f-string functionality without using an apply?

FObersteiner
  • 22,500
  • 8
  • 42
  • 72

2 Answers2

2

what about using round?

example:

df.select(
    [
        pl.format("as string {}", pl.col("a").round(3)),
        ]
    )

shape: (3, 1)
┌─────────────────┐
│ literal         │
│ ---             │
│ str             │
╞═════════════════╡
│ as string 0.333 │
│ as string 0.25  │
│ as string 0.143 │
└─────────────────┘
Luca
  • 1,216
  • 6
  • 10
1

If the number of decimals was the same for all cols, float_precision on the write_csv method would be sufficient:

df = pl.DataFrame( {"colx": [1/3, 1/4, 1/7, 2]} )
print( df.write_csv( None,float_precision=3 ) )

# colx
# 0.333
# 0.250
# 0.143
# 2.000

Otherwise, you can use this (slightly ungainly) utility function to get the desired per-column "float → string" rounding behaviour (including trailing zeros - if you don't need the trailing zeros then stick with @Luca's "round" approach as it'll be more performant), and then export to CSV:

def round_str( col:str, n:int ):
    return ( 
        pl.col( col ).round( n ).cast( str ) + pl.lit( "0"*n ) 
    ).str.replace( rf"^(\d+\.\d{{{n}}}).*$","$1" ).alias( col )

Example:

df = pl.DataFrame(
    {
        "colx": [1/3, 1/4, 1/7, 2.00],
        "coly": [1/4, 1/5, 1/6, 1.00],
        "colz": [3/4, 7/8, 9/5, 0.09],
    }
).with_columns(
    round_str( "colx",5 ),
    round_str( "coly",3 ),
    round_str( "colz",1 ),
)
# ┌─────────┬───────┬──────┐
# │ colx    ┆ coly  ┆ colz │
# │ ---     ┆ ---   ┆ ---  │
# │ str     ┆ str   ┆ str  │
# ╞═════════╪═══════╪══════╡
# │ 0.33333 ┆ 0.250 ┆ 0.8  │
# │ 0.25000 ┆ 0.200 ┆ 0.9  │
# │ 0.14286 ┆ 0.167 ┆ 1.8  │
# │ 2.00000 ┆ 1.000 ┆ 0.1  │
# └─────────┴───────┴──────┘

print( df.write_csv(None) )

# colx,coly,colz
# 0.33333,0.250,0.8
# 0.25000,0.200,0.9
# 0.14286,0.167,1.8
# 2.00000,1.000,0.1

(Ideally the float_precision param on write_csv would allow a dict; something for the TODO list ;)

  • Thanks for looking into this! I've been using `float_precision` so far, but now the requirement arose to have that different by column, essentially to provide precision information for the data. I know, there are better ways to do that... Anyhow, I'd imagine that having a configurable float_precision kwarg could be helpful for others as well. – FObersteiner Mar 31 '23 at 06:17