11

I was trying to search whether there would be a way to change the dtypes for the strings with numbers easily. For example, the problem I face is as follows:

df = pl.Dataframe({"foo": 
    ["100CT pen", "pencils 250CT", "what 125CT soever", "this is a thing"]}
)

I could extract and create a new column named {"bar": ["100", "250", "125", ""]}. But then I couldn't find a handy function that converts this column to Int64 or float dtypes so that the result is [100, 250, 125, null].

Also, vice versa. Sometimes it would be useful to have a handy function that converts the column of [100, 250, 125, 0] to ["100", "250", "125", "0"]. Is it something that already exists?

pppery
  • 3,731
  • 22
  • 33
  • 46
momentlost
  • 111
  • 1
  • 1
  • 3

3 Answers3

18

The easiest way to accomplish this is with the cast expression.

String to Int/Float

To cast from a string to an integer (or float):

import polars as pl

df = pl.DataFrame({"bar": ["100", "250", "125", ""]})
df.with_column(pl.col('bar').cast(pl.Int64, strict=False).alias('bar_int'))
shape: (4, 2)
┌─────┬─────────┐
│ bar ┆ bar_int │
│ --- ┆ ---     │
│ str ┆ i64     │
╞═════╪═════════╡
│ 100 ┆ 100     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 250 ┆ 250     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 125 ┆ 125     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│     ┆ null    │
└─────┴─────────┘

A handy list of available datatypes is here. These are all aliased under polars, so you can refer to them easily (e.g., pl.UInt64).

For the data you describe, I recommend using strict=False to avoid having one mangled number among millions of records result in an exception that halts everything.

Int/Float to String

The same process can be used to convert numbers to strings - in this case, the utf8 datatype.

Let me modify your dataset slightly:

df = pl.DataFrame({"bar": [100.5, 250.25, 1250000, None]})
df.with_column(pl.col("bar").cast(pl.Utf8, strict=False).alias("bar_string"))
shape: (4, 2)
┌────────┬────────────┐
│ bar    ┆ bar_string │
│ ---    ┆ ---        │
│ f64    ┆ str        │
╞════════╪════════════╡
│ 100.5  ┆ 100.5      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250.25 ┆ 250.25     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.25e6 ┆ 1250000.0  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null   ┆ null       │
└────────┴────────────┘

If you need more control over the formatting, you can use the apply method and Python's new f-string formatting.

df.with_column(
    pl.col("bar").apply(lambda x: f"This is ${x:,.2f}!").alias("bar_fstring")
)
shape: (4, 2)
┌────────┬────────────────────────┐
│ bar    ┆ bar_fstring            │
│ ---    ┆ ---                    │
│ f64    ┆ str                    │
╞════════╪════════════════════════╡
│ 100.5  ┆ This is $100.50!       │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250.25 ┆ This is $250.25!       │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.25e6 ┆ This is $1,250,000.00! │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null   ┆ null                   │
└────────┴────────────────────────┘

I found this web page to be a handy reference for those unfamiliar with f-string formatting.

Community
  • 1
  • 1
9

As an addition to @cbilot 's answer.

You don't need to use slow python lambda functions to use special string formatting of expressions. Polars has a format function for this purpose:


df = pl.DataFrame({"bar": ["100", "250", "125", ""]})

df.with_columns([
    pl.format("This is {}!", pl.col("bar"))
])
shape: (4, 2)
┌─────┬──────────────┐
│ bar ┆ literal      │
│ --- ┆ ---          │
│ str ┆ str          │
╞═════╪══════════════╡
│ 100 ┆ This is 100! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250 ┆ This is 250! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 125 ┆ This is 125! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│     ┆ This is !    │
└─────┴──────────────┘

ritchie46
  • 10,405
  • 1
  • 24
  • 43
3

For other data manipulation in polars, like string to datetime, use strptime().

import polars as pl
df = pl.DataFrame(df_pandas)

df

shape: (100, 2)
┌────────────┬────────┐
│ dates_col  ┆ ticker │
│ ---        ┆ ---    │
│ str        ┆ str    │
╞════════════╪════════╡
│ 2022-02-25 ┆ RDW    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2008-05-28 ┆ ARTX   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2015-05-21 ┆ CBAT   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2009-02-09 ┆ ANNB   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤

Use it like this, converting the column to string:

df.with_column(pl.col("dates_col").str.strptime(pl.Datetime, fmt="%Y-%m-%d").cast(pl.Datetime))

shape: (100, 2)
┌─────────────────────┬────────┐
│ dates_col           ┆ ticker │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ str    │
╞═════════════════════╪════════╡
│ 2022-02-25 00:00:00 ┆ RDW    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2008-05-28 00:00:00 ┆ ARTX   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2015-05-21 00:00:00 ┆ CBAT   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2009-02-09 00:00:00 ┆ ANNB   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
Artur Dutra
  • 515
  • 3
  • 18