3

Based on the document of polars, one can use json_path_match to extract JSON fields into string series.

But can we do something like pandas.Series.map(json.loads) to convert the whole JSON string at once? One can then further convert the loaded JSON series into another dataframe with sane dtypes.

I know I can do it first in pandas, but I'm looking for a way in polars.

Saddle Point
  • 3,074
  • 4
  • 23
  • 33

1 Answers1

3

I should first point out that there is a polars.read_json method. For example:

import polars as pl
import io

json_file = """[{"a":"1", "b":10, "c":[1,2,3]},
{"a":"2", "b":20, "c":[3,4,5]},
{"a":"3.1", "b":30.2, "c":[8,8,8]},
{"a":"4", "b":40.0, "c":[9,9,90]}]
"""

pl.read_json(io.StringIO(json_file))
shape: (4, 3)
┌─────┬──────┬────────────┐
│ a   ┆ b    ┆ c          │
│ --- ┆ ---  ┆ ---        │
│ str ┆ f64  ┆ list [i64] │
╞═════╪══════╪════════════╡
│ 1   ┆ 10.0 ┆ [1, 2, 3]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 20.0 ┆ [3, 4, 5]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ [8, 8, 8]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 40.0 ┆ [9, 9, 90] │
└─────┴──────┴────────────┘

But to answer your specific question about JSON data already loaded into a Series, I think what you're looking for is the polars.Series.apply method, which will apply a callable function to each cell of a Polars Series.

For example, let's say we have the following JSON fields already loaded into a Series in a Polars DataFrame:

import json
import polars as pl

df = pl.DataFrame(
    {
        "json_val": [
            '{"a":"1", "b":10, "c":[1,2,3]}',
            '{"a":"2", "b":20, "c":[3,4,5]}',
            '{"a":"3.1", "b":30.2, "c":[8,8,8]}',
            '{"a":"4", "b":40.0, "c":[9,9,90]}',
        ]
    }
)
print(df)
shape: (4, 1)
┌─────────────────────────────────────┐
│ json_val                            │
│ ---                                 │
│ str                                 │
╞═════════════════════════════════════╡
│ {"a":"1", "b":10, "c":[1,2,3]}      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"2", "b":20, "c":[3,4,5]}      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"3.1", "b":30.2, "c":[8,8,8... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"4", "b":40.0, "c":[9,9,90]... │
└─────────────────────────────────────┘

We can use apply and the json.loads function. In this example, that will yield a Series of type struct:

df.select(pl.col("json_val").apply(json.loads))
shape: (4, 1)
┌──────────────────────────┐
│ json_val                 │
│ ---                      │
│ struct[3]{'a', 'b', 'c'} │
╞══════════════════════════╡
│ {"1",10,[1, 2, 3]}       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2",20,[3, 4, 5]}       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1",30,[8, 8, 8]}     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4",40,[9, 9, 90]}      │
└──────────────────────────┘

(One caution, notice how column b has been truncated to an integer.)

Depending on the structure of your JSON, you may be able to also use the polars.DataFrame.unnest function to split the json_val struct column into separate columns.

df.select(pl.col("json_val").apply(json.loads)).unnest("json_val")
shape: (4, 3)
┌─────┬─────┬────────────┐
│ a   ┆ b   ┆ c          │
│ --- ┆ --- ┆ ---        │
│ str ┆ i64 ┆ list [i64] │
╞═════╪═════╪════════════╡
│ 1   ┆ 10  ┆ [1, 2, 3]  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 20  ┆ [3, 4, 5]  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30  ┆ [8, 8, 8]  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 40  ┆ [9, 9, 90] │
└─────┴─────┴────────────┘

Does this help get you started?

Edit: handling type-conversion issues

One general strategy that I use with any un-typed input file (especially csv files) is to return all values as a string/polars.Utf8 type. That way, I can explicitly convert types later, after I've had a chance to visually inspect the results. (I've been burned too often by "automatic" type conversions.)

The json.loads method has two helpful keyword options parse_float and parse_int that will help in this case. We can use a simple lambda function to tell the json parser to leave integer and float columns as strings.

# define our own translate function to keep floats/ints as strings
def json_translate(json_str: str):
    return json.loads(json_str, parse_float=lambda x: x, parse_int=lambda x: x)

df.select(pl.col("json_val").apply(f=json_translate))
shape: (4, 1)
┌────────────────────────────────┐
│ json_val                       │
│ ---                            │
│ struct[3]{'a', 'b', 'c'}       │
╞════════════════════════════════╡
│ {"1","10",["1", "2", "3"]}     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2","20",["3", "4", "5"]}     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1","30.2",["8", "8", "8"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4","40.0",["9", "9", "90"]}  │
└────────────────────────────────┘

Notice that all the integer and float values are left as strings, and remain so when we use the unnest function (the column headers below show "str").

df.select(pl.col("json_val").apply(f=json_translate)).unnest('json_val')
shape: (4, 3)
┌─────┬──────┬──────────────────┐
│ a   ┆ b    ┆ c                │
│ --- ┆ ---  ┆ ---              │
│ str ┆ str  ┆ list [str]       │
╞═════╪══════╪══════════════════╡
│ 1   ┆ 10   ┆ ["1", "2", "3"]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 20   ┆ ["3", "4", "5"]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ ["8", "8", "8"]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 40.0 ┆ ["9", "9", "90"] │
└─────┴──────┴──────────────────┘

From this point, you can use Polars' cast expression to convert the strings to the specific numeric types that you want. Here's a Stack Overflow question that can help with the cast.

  • Thanks for the detailed explanation! I did search `map` but missed `apply`. Why does the mentioned caution happen in polars but not in pandas? – Saddle Point Apr 29 '22 at 06:18
  • Or can I specify `b` a float type? – Saddle Point Apr 29 '22 at 06:25
  • I'll edit my answer to include a strategy for dealing with type conversion issues. –  Apr 29 '22 at 16:21
  • 1
    I believe automatic casting returns `b` as an integer because in the first row, the value of `b` is `10`, rather than `10.0` or something that looks like a float. But we can get past this by importing all values as strings, and converting in a later step, once the data is in a Polars DataFrame. –  Apr 29 '22 at 16:44
  • I found that polars failed to load JSON array strings like `pl.DataFrame({"json_val": ['[{"x": 1, "y": 2}, {"x": 3, "y": 4}]']})` (but works for `'[]'`) using `apply`, do you have any idea of this? Thanks. – Saddle Point May 13 '22 at 07:37
  • 1
    For `[{"x": 1, "y": 2}, {"x": 3, "y": 4}]`, `json.loads` returns a `list of dictionaries` . The Polars `apply` method, however, cannot convert this list of dictionaries to a column in the Polars DataFrame. You may want to log a GitHub issue to see if/how the `apply` method can support this. –  May 14 '22 at 01:02