1

I have a simple pydantic model with nested data structures. I want to be able to simply save and load instances of this model as .json file.

All models inherit from a Base class with simple configuration.

class Base(pydantic.BaseModel):
    class Config:
        extra = 'forbid'   # forbid use of extra kwargs

There are some simple data models with inheritance

class Thing(Base):
    thing_id: int

class SubThing(Thing):
    name: str

And a Container class, which holds a Thing

class Container(Base):
    thing: Thing

I can create a Container instance and save it as .json

# make instance of container
c = Container(
    thing = SubThing(
        thing_id=1,
        name='my_thing')
)

json_string = c.json(indent=2)
print(json_string)

"""
{
  "thing": {
    "thing_id": 1,
    "name": "my_thing"
  }
}
"""

but the json string does not specify that the thing field was constructed using a SubThing. As such, when I try to load this string into a new Container instance, I get an error.

print(c)
"""
Traceback (most recent call last):
  File "...", line 36, in <module>
    c = Container.parse_raw(json_string)
  File "pydantic/main.py", line 601, in pydantic.main.BaseModel.parse_raw
  File "pydantic/main.py", line 578, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 406, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Container
thing -> name
  extra fields not permitted (type=value_error.extra)
"""

Is there a simple way to save the Container instance while retaining information about the thing class type such that I can reconstruct the initial Container instance reliably? I would like to avoid pickling the object if possible.

One possible solution is to serialize manually, for example using


def serialize(attr_name, attr_value, dictionary=None):
    if dictionary is None:
        dictionary = {}
    if not isinstance(attr_value, pydantic.BaseModel):
        dictionary[attr_name] = attr_value
    else:
        sub_dictionary = {}
        for (sub_name, sub_value) in attr_value:
            serialize(sub_name, sub_value, dictionary=sub_dictionary)
        dictionary[attr_name] = {type(attr_value).__name__: sub_dictionary}
    return dictionary


c1 = Container(
    container_name='my_container',
    thing=SubThing(
        thing_id=1,
        name='my_thing')
)

from pprint import pprint as print
print(serialize('Container', c1))

{'Container': {'Container': {'container_name': 'my_container',
                             'thing': {'SubThing': {'name': 'my_thing',
                                                    'thing_id': 1}}}}}

but this gets rid of most of the benefits of leveraging the package for serialization.

twhughes
  • 456
  • 6
  • 17
  • 1
    why are you using `pydantic` in any case - like do you benefit from the validations it provides? just curious – rv.kvetch Sep 19 '21 at 22:04
  • 1
    yes, I use it mainly for the validations, but in principle I could use something else. This is an extremely simplified version of my actual application. – twhughes Sep 19 '21 at 22:14
  • 1
    doing only a cursory look on the web, it looks like this is a known problem that `pydantic` doesn't support loading nested json to a model class, yet there are plans for future support in this use case. I was actually surprised that pydantic doesn't parse a dict to a nested model - seems like a common enough use case to me. – rv.kvetch Sep 19 '21 at 22:15
  • 1
    Using a root validator as mentioned [here](https://github.com/samuelcolvin/pydantic/issues/1189#issuecomment-578084930) might also work – rv.kvetch Sep 19 '21 at 22:16
  • Hm, thanks for the help. Do you know if this issue exists for other packages, such as dataclasses? What would you recommend to handle serialization of nested dataclass-like objects like this? Note that `dict(c)` seems to retain some field information, so one brute force option would be to write my own serializer, but I'd prefer leveraging the package – twhughes Sep 19 '21 at 22:21
  • 1
    I've tested serialization with dataclasses and that works perfectly for the most part. I did notice an issue with some field types, namely `defaultdict` fields for example. It looks like dataclasses doesn't handle serialization of such field types as expected (I guess it treats it as a normal dict). You can use the `dataclasses.asdict()` helper function to serialize a dataclass instance, which also works for nested dataclasses. The only problem is de-serializing it back from a dict, which unfortunately seems to be a missing link in dataclasses. – rv.kvetch Sep 20 '21 at 00:12
  • 1
    If you're interested in using dataclasses, you can take a look at [this answer](https://stackoverflow.com/questions/69128123/nested-python-dataclasses-with-list-annotations/69133191#69133191) that I added a while back, as it might be useful. With such a library you can simply use dataclasses and it will provide (de)serialization with minimal changes. It also supports loading a nested dataclass structure from any plain dict. – rv.kvetch Sep 20 '21 at 00:15
  • 1
    Added also a separate answer below (and an approach that I was able to get working with `pydantic` for this use case) – rv.kvetch Sep 20 '21 at 02:08

2 Answers2

2

Try this solution, which I was able to get it working with pydantic. It's a bit ugly and somewhat hackish, but at least it works as expected.

import pydantic


class Base(pydantic.BaseModel):
    class Config:
        extra = 'forbid'   # forbid use of extra kwargs


class Thing(Base):
    thing_id: int


class SubThing(Thing):
    name: str


class Container(Base):
    thing: Thing

    def __init__(self, **kwargs):
        # This answer helped steer me towards this solution:
        #   https://stackoverflow.com/a/66582140/10237506
        if not isinstance(kwargs['thing'], SubThing):
            kwargs['thing'] = SubThing(**kwargs['thing'])
        super().__init__(**kwargs)


def main():
    # make instance of container
    c1 = Container(
        thing=SubThing(
            thing_id=1,
            name='my_thing')
    )

    d = c1.dict()
    print(d)
    # {'thing': {'thing_id': 1, 'name': 'my_thing'}}

    # Now it works!
    c2 = Container(**d)

    print(c2)
    # thing=SubThing(thing_id=1, name='my_thing')
    
    # assert that the values for the de-serialized instance is the same
    assert c1 == c2


if __name__ == '__main__':
    main()

If you don't need some of the features that pydantic provides such as data validation, you can just use normal dataclasses easily enough. You can pair this with a (de)serialization library like dataclass-wizard that provides automatic case transforms and type conversion (for ex. string to annotated int) that works much the same as it does with pydantic. Here is a straightforward enough usage of that below:

from dataclasses import dataclass

from dataclass_wizard import asdict, fromdict


@dataclass
class Thing:
    thing_id: int


@dataclass
class SubThing(Thing):
    name: str


@dataclass
class Container:
    # Note: I had to update the annotation to `SubThing`. otherwise
    # when de-serializing, it creates a `Thing` instance which is not
    # what we want.
    thing: SubThing


def main():
    # make instance of container
    c1 = Container(
        thing=SubThing(
            thing_id=1,
            name='my_thing')
    )

    d = asdict(c1)
    print(d)
    # {'thing': {'thingId': 1, 'name': 'my_thing'}}

    # De-serialize a dict object in a new `Container` instance
    c2 = fromdict(Container, d)

    print(c2)
    # Container(thing=SubThing(thing_id=1, name='my_thing'))

    # assert that the values for the de-serialized instance is the same
    assert c1 == c2


if __name__ == '__main__':
    main()
rv.kvetch
  • 9,940
  • 3
  • 24
  • 53
  • Thanks for the suggestion, I think the 2nd one is not quite right because Container.thing now accepts SubThing, not Thing. – twhughes Sep 20 '21 at 03:28
  • 1
    Yep, you are right. I had to change the annotation to `thing: SubThing` as otherwise it'll try to load a dict into a `Thing` type. I'll update the answer to clarify. – rv.kvetch Sep 20 '21 at 03:47
  • 1
    actually, I see the problem now. I guess I didn't read the question above too carefully. – rv.kvetch Sep 20 '21 at 03:56
  • 1
    No, but there are be several other subclasses of `Things` so it should be able to handle those. – twhughes Sep 20 '21 at 03:56
  • 1
    I wrote a quick recursive function to serialize it by hand. I will edit my original question with the code. – twhughes Sep 20 '21 at 03:57
  • No worries, it's kind of an obscure question and I appreciate the help, I'm also surprised this doesn't seem to be supported natively in Pydantic or dataclasses. – twhughes Sep 20 '21 at 04:00
  • 1
    Yep, no problem and I get what you were asking now. I guess if you wanted to annotate a field as a more generic class such as `Thing` but populate it with a sub-class such as `SubThing` later, that would only make it a bit difficult to de-serialize back into a Container (since it would be looking to create a `Thing`, based on the annotation). A custom serializer with pydantic should hopefully work out for this use case. – rv.kvetch Sep 20 '21 at 04:06
  • 1
    I came up with a temporary solution to handle subclasses of `Thing` only, just declare `thing` as type `Union[SubThing1, SubThing2, ...]`. This covers my use case, with the exception that if the subclasses have the same kwarg signature, Pydantic will simply choose to serialize `thing` as `SubThing1`, no matter how it was initialized. It would be nice if Pydantic supported making the serialization type-aware, to handle subclasses like this. But for now I think this solution will do. – twhughes Sep 20 '21 at 15:08
0

Since pydantic 2.0, pydantic no longer digs through all your models by default and only outputs the immediate models to dict, string, json, etc.

They do this to

[...] ensure that you know precisely which fields could be included when serializing, even if subclasses get passed when instantiating the object. In particular, this can help prevent surprises when adding sensitive information like secrets as fields of subclasses.

See the migration warning here.

The suggested solution is to serialize with duck typeing:

from pydantic import BaseModel, SerializeAsAny

class Thing(BaseModel):
    thing_id: int

class SubThing(Thing):
    name: str

class Container(BaseModel):
    thing: SerializeAsAny[Thing]

This seemed to solve the problem for me: .dict() and .model_dump() now work as intended.

Energeneer
  • 161
  • 2
  • 7