1

This is the continuation of my previous question User glebcom helped me with transition of coordinates from a string to list of float64 values. In the answer I found 2 methods how to calculate distance between coordinates:

  1. using formula numpy.linalg.norm(a-b)
  2. using from scipy.spatial import distance:dst = distance.euclidean(a, b) How to apply one of these formulas to calculate the distance between corrdinates from column c and d from polars data frame
import polars as pl
from scipy.spatial import distance
import numpy as np
pl.Config.set_fmt_str_lengths(2000)
data={"a": ["782.83    7363.51    6293    40   PD","850.68    7513.1    6262.17    40   PD"], "b": ["795.88    7462.65    6293    40   PD","1061.64    7486.08    6124.85    40   PD"]}
df=pl.DataFrame(data)
df=df.with_columns([
    pl.col("a").str.replace_all(r" +", " ")\
        .str.split(" ").arr.slice(0,3)\
        .cast(pl.List(pl.Float64)).alias("c"),\
    pl.col("b").str.replace_all(r" +", " ")\
        .str.split(" ").arr.slice(0,3)\
        .cast(pl.List(pl.Float64)).alias("d")\
])
print(df)

My tries were

df=df.with_columns(np.linalg.norm(pl.col("C")-pl.col("d")).alias("distance"))
or
df=df.with_columns(distance(pl.col("C"),pl.col("d")).alias("distance"))

but none of the above works. Thanks in advance for your assistance.

Artur

glebcom
  • 1,131
  • 5
  • 14
Artup
  • 49
  • 4

2 Answers2

3

You won't be able to call numpy.linalg.norm directly on your polars data frame. It expects a numpy array of shape (N, n) (where N is your number of points and n is your number of dimension, 3).

You can prepare the data your self, pass it to numpy and put back the results in polars.

First, calculate the difference between the coordinates of your two points, across all 3 dimensions:

diffs = df.select(
    [
        (pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}")
        for i in range(3)
    ]
)
┌─────────┬────────┬────────┐
│ diff_0  ┆ diff_1 ┆ diff_2 │
│ ---     ┆ ---    ┆ ---    │
│ f64     ┆ f64    ┆ f64    │
╞═════════╪════════╪════════╡
│ -13.05  ┆ -99.14 ┆ 0.0    │
│ -210.96 ┆ 27.02  ┆ 137.32 │
└─────────┴────────┴────────┘

Then convert it to numpy and call the function:

import numpy.linalg
distance=numpy.linalg.norm(diffs.to_numpy(), axis=1)
pl.Series(distance).alias("distance")
┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘

Alternatively you can calculate the euclidian product yourself:

df.select(
    [
        (pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}") ** 2
        for i in range(3)
    ]
).sum(axis=1).sqrt()
┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘

ps: scipy.spatial.distance.euclidean won't work because it only works with one point at time which would make it very slow in polars.

0x26res
  • 11,925
  • 11
  • 54
  • 108
1

Solution with np.linalg.norm inside map

def l2_norm(s: pl.Series) -> pl.Series:
    # 1) difference: c-d
    diff = s.struct.field("c").to_numpy() - s.struct.field("d").to_numpy()
    # 2) apply np.linalg.norm()
    return pl.Series(diff).apply(
        lambda x: np.linalg.norm(np.array(x))
    )

df.with_columns([
    pl.struct(["c", "d"]).map(l2_norm).alias("distance")
])
┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘
glebcom
  • 1,131
  • 5
  • 14