How to use numpy function to add polars dataframe column

Question

This is the continuation of my previous question User glebcom helped me with transition of coordinates from a string to list of float64 values. In the answer I found 2 methods how to calculate distance between coordinates:

using formula numpy.linalg.norm(a-b)
using from scipy.spatial import distance:dst = distance.euclidean(a, b) How to apply one of these formulas to calculate the distance between corrdinates from column c and d from polars data frame

import polars as pl
from scipy.spatial import distance
import numpy as np
pl.Config.set_fmt_str_lengths(2000)
data={"a": ["782.83    7363.51    6293    40   PD","850.68    7513.1    6262.17    40   PD"], "b": ["795.88    7462.65    6293    40   PD","1061.64    7486.08    6124.85    40   PD"]}
df=pl.DataFrame(data)
df=df.with_columns([
    pl.col("a").str.replace_all(r" +", " ")\
        .str.split(" ").arr.slice(0,3)\
        .cast(pl.List(pl.Float64)).alias("c"),\
    pl.col("b").str.replace_all(r" +", " ")\
        .str.split(" ").arr.slice(0,3)\
        .cast(pl.List(pl.Float64)).alias("d")\
])
print(df)

My tries were

df=df.with_columns(np.linalg.norm(pl.col("C")-pl.col("d")).alias("distance"))
or
df=df.with_columns(distance(pl.col("C"),pl.col("d")).alias("distance"))

but none of the above works. Thanks in advance for your assistance.

Artur

score 3 · Accepted Answer · answered Feb 06 '23 at 19:11

You won't be able to call numpy.linalg.norm directly on your polars data frame. It expects a numpy array of shape (N, n) (where N is your number of points and n is your number of dimension, 3).

You can prepare the data your self, pass it to numpy and put back the results in polars.

First, calculate the difference between the coordinates of your two points, across all 3 dimensions:

diffs = df.select(
    [
        (pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}")
        for i in range(3)
    ]
)

┌─────────┬────────┬────────┐
│ diff_0  ┆ diff_1 ┆ diff_2 │
│ ---     ┆ ---    ┆ ---    │
│ f64     ┆ f64    ┆ f64    │
╞═════════╪════════╪════════╡
│ -13.05  ┆ -99.14 ┆ 0.0    │
│ -210.96 ┆ 27.02  ┆ 137.32 │
└─────────┴────────┴────────┘

Then convert it to numpy and call the function:

import numpy.linalg
distance=numpy.linalg.norm(diffs.to_numpy(), axis=1)
pl.Series(distance).alias("distance")

┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘

Alternatively you can calculate the euclidian product yourself:

df.select(
    [
        (pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}") ** 2
        for i in range(3)
    ]
).sum(axis=1).sqrt()

┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘

ps: scipy.spatial.distance.euclidean won't work because it only works with one point at time which would make it very slow in polars.

glebcom · Answer 2 · 2023-02-06T19:21:24.300

1

Solution with np.linalg.norm inside map

def l2_norm(s: pl.Series) -> pl.Series:
    # 1) difference: c-d
    diff = s.struct.field("c").to_numpy() - s.struct.field("d").to_numpy()
    # 2) apply np.linalg.norm()
    return pl.Series(diff).apply(
        lambda x: np.linalg.norm(np.array(x))
    )

df.with_columns([
    pl.struct(["c", "d"]).map(l2_norm).alias("distance")
])

┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘

edited Feb 06 '23 at 19:21

answered Feb 06 '23 at 19:16

glebcom

1,131
5
14

Thanks for the answer. That's a pity one can't mark two answers as solutions. – Artup Feb 06 '23 at 23:34

How to use numpy function to add polars dataframe column

2 Answers2