I should first point out that there is a polars.read_json
method. For example:
import polars as pl
import io
json_file = """[{"a":"1", "b":10, "c":[1,2,3]},
{"a":"2", "b":20, "c":[3,4,5]},
{"a":"3.1", "b":30.2, "c":[8,8,8]},
{"a":"4", "b":40.0, "c":[9,9,90]}]
"""
pl.read_json(io.StringIO(json_file))
shape: (4, 3)
┌─────┬──────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list [i64] │
╞═════╪══════╪════════════╡
│ 1 ┆ 10.0 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ [9, 9, 90] │
└─────┴──────┴────────────┘
But to answer your specific question about JSON data already loaded into a Series, I think what you're looking for is the polars.Series.apply
method, which will apply a callable function to each cell of a Polars Series.
For example, let's say we have the following JSON fields already loaded into a Series in a Polars DataFrame:
import json
import polars as pl
df = pl.DataFrame(
{
"json_val": [
'{"a":"1", "b":10, "c":[1,2,3]}',
'{"a":"2", "b":20, "c":[3,4,5]}',
'{"a":"3.1", "b":30.2, "c":[8,8,8]}',
'{"a":"4", "b":40.0, "c":[9,9,90]}',
]
}
)
print(df)
shape: (4, 1)
┌─────────────────────────────────────┐
│ json_val │
│ --- │
│ str │
╞═════════════════════════════════════╡
│ {"a":"1", "b":10, "c":[1,2,3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"2", "b":20, "c":[3,4,5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"3.1", "b":30.2, "c":[8,8,8... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"4", "b":40.0, "c":[9,9,90]... │
└─────────────────────────────────────┘
We can use apply
and the json.loads
function. In this example, that will yield a Series of type struct
:
df.select(pl.col("json_val").apply(json.loads))
shape: (4, 1)
┌──────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞══════════════════════════╡
│ {"1",10,[1, 2, 3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2",20,[3, 4, 5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1",30,[8, 8, 8]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4",40,[9, 9, 90]} │
└──────────────────────────┘
(One caution, notice how column b
has been truncated to an integer.)
Depending on the structure of your JSON, you may be able to also use the polars.DataFrame.unnest
function to split the json_val
struct column into separate columns.
df.select(pl.col("json_val").apply(json.loads)).unnest("json_val")
shape: (4, 3)
┌─────┬─────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ list [i64] │
╞═════╪═════╪════════════╡
│ 1 ┆ 10 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40 ┆ [9, 9, 90] │
└─────┴─────┴────────────┘
Does this help get you started?
Edit: handling type-conversion issues
One general strategy that I use with any un-typed input file (especially csv files) is to return all values as a string/polars.Utf8
type. That way, I can explicitly convert types later, after I've had a chance to visually inspect the results. (I've been burned too often by "automatic" type conversions.)
The json.loads
method has two helpful keyword options parse_float
and parse_int
that will help in this case. We can use a simple lambda function to tell the json parser to leave integer and float columns as strings.
# define our own translate function to keep floats/ints as strings
def json_translate(json_str: str):
return json.loads(json_str, parse_float=lambda x: x, parse_int=lambda x: x)
df.select(pl.col("json_val").apply(f=json_translate))
shape: (4, 1)
┌────────────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞════════════════════════════════╡
│ {"1","10",["1", "2", "3"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2","20",["3", "4", "5"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1","30.2",["8", "8", "8"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4","40.0",["9", "9", "90"]} │
└────────────────────────────────┘
Notice that all the integer and float values are left as strings, and remain so when we use the unnest
function (the column headers below show "str").
df.select(pl.col("json_val").apply(f=json_translate)).unnest('json_val')
shape: (4, 3)
┌─────┬──────┬──────────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list [str] │
╞═════╪══════╪══════════════════╡
│ 1 ┆ 10 ┆ ["1", "2", "3"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ ["3", "4", "5"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ ["8", "8", "8"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ ["9", "9", "90"] │
└─────┴──────┴──────────────────┘
From this point, you can use Polars' cast
expression to convert the strings to the specific numeric types that you want. Here's a Stack Overflow question that can help with the cast
.