4

Context

I work with python 3.9.6 and pandas 1.3.0.
My colleague works with python 3.6.12 and pandas 1.1.5.
I want to create a dataframe and share it with my colleague, without asking them to update their environment (that request would incur some hassle).

Question

How can I write out a dataframe to a file using my newer python/pandas versions in a way that their older python/pandas versions can read it in as a dataframe?

What I've tried or looked into

Default .to_pickle() method
If in the newer python environment I write:

df.to_pickle(r"C:\somepath\file.bz2")

and in the older python environment I try:

df.read_pickle(r"C:\somepath\file.bz2")

I get:

ValueError: unsupported pickle protocol: 5

Specifying a protocol version in the .to_pickle() method
Fine, I thought, I'll specify a different protocol.

df.to_pickle(r"C:\somepath\file.bz2", protocol=3)

However, if in the older python environment I try to load it I get

AttributeError: module 'pandas.core.internals.blocks' has no attribute 'new_block'

This error remains for all protocol versions from 0 to 5.

Previous question on protocol version
I found this question, which only has the answer that the pandas versions must match.
I find it hard to believe that's the only solution, as then, what's the point of having multiple pickle protocols which are meant to be backward compatible?

Previous question on the new_block attribute
This question mentions the same error with the missing new_block attribute. Again, the answer is to update the pandas version (over which I have no control at the moment).

Downgrading the newer python/pandas versions
I could downgrade my newer python/pandas to match my colleague's versions.
Haven't tried it yet, but I assume that should work. However, that would really be a last resort, as then I would need a special "low version" environment to work with this one colleague.

Exporting to CSV
This works, but it loses some dataframe specific features like data types and NaN values, so I don't consider this a valid workaround.

Pickling separately
I thought maybe the issue lies in the pandas .to_pickle() or .read_pickle() method, so I tried using the pickle library directly to write the file (using protocol 3):

import pickle

with open('file.pkl', 'wb') as f:
    pickle.dump(df, f, 3)

... and then read it in the older python environment:

import pickle

with open('file.pkl', 'rb') as f:
    df = pickle.load(f)

Unfortunately, I am still met with

AttributeError: module 'pandas.core.internals.blocks' has no attribute 'new_block'

Converting to a dict, then pickling that
Per the suggestion in the comments I tried:

ddf = df.to_dict()

with open('file.pkl', 'wb') as f:
    pickle.dump(ddf, f, 3)

But then, when I try to read it in the older environment, I get:

AttributeError: Can't get attribute '_unpickle_timestamp' on <module 'pandas._libs.tslibs.timestamps

My DataFrame has a timestamp column in it, which apparently cannot be unpickled by the older pandas version.

Saaru Lindestøkke
  • 2,067
  • 1
  • 25
  • 51
  • Maybe saving the dataframe as `csv`? Or convert the dataframe to dictionary with `.to_dict()` and pickle this dictionary (with correct protocol) - then recreate the dataframe from this dictionary. – Andrej Kesely Sep 07 '21 at 17:50
  • When saving to CSV one loses data types and NaN values, so that is not preferable. I'll look into the `to_dict()` method, but I'm hoping there's a single step method to do this. – Saaru Lindestøkke Sep 07 '21 at 20:35
  • @AndrejKesely Just tried your suggestion, unfortunately it runs into another incompatibility. – Saaru Lindestøkke Sep 08 '21 at 07:35
  • Why don't you use a virtual environment with the lower version requirements? – iacob Sep 08 '21 at 07:53
  • 1
    Pandas 1.1.5 is compatible with Python 3.9 – if it's impossible for your colleague to upgrade, then maybe you should downgrade to 1.1.5? – AKX Sep 08 '21 at 07:54
  • @iacob, I've considered that but: "then I would need a special 'low version' environment to work with this one colleague." – Saaru Lindestøkke Sep 08 '21 at 08:02
  • @AKX, that's a good suggestion, but see my comment above. Additionally I would then need to potentially miss out on the improvements/bug fixes of later pandas versions. All in all I would like to prevent up/downgrading libraries. – Saaru Lindestøkke Sep 08 '21 at 08:02
  • @SaaruLindestøkke what's wrong with that? This is the point of virtual environments, to enable multiple people to work with the same version/library requirements in a sandboxed way. Your colleague could instead use a virtual environment to match your setup. – iacob Sep 08 '21 at 08:19
  • 1
    Yeah, if your colleague isn't using virtualenvs, they should. Then you can use e.g. [`pip-tools`](https://github.com/jazzband/pip-tools) to lock all requirement versions for this project. – AKX Sep 08 '21 at 08:27
  • I'll (reluctantly) set-up another conda env if that's the only way. I find it simply hard to believe that pickled dataframes cannot be made backwards compatible between version `1.3` and `1.1`. After all, it's only [a minor semantic versioning difference](https://semver.org/) and that should yield "functionality in a backwards compatible manner". – Saaru Lindestøkke Sep 08 '21 at 08:38

1 Answers1

1

Why the above don't work

what's the point of having multiple pickle protocols which are meant to be backward compatible?

These protocols are designed for different ways different versions of Python read pickled files. They do not convert objects built with a recent version of a specific library into a version compatible with an older version.

it's only a minor semantic versioning difference and that should yield "functionality in a backwards compatible manner".

You're misunderstanding this, it means that code written with the earlier version will still function as expected using the new version. It does not mean new functions introduced in the recent version will work in the older one (or objects created using them).

Resolving version conflicts with a venv

If this is a shared project with multiple collaborators your development should be in a virtual environment where you can load the exact requisites of the project (in terms of both python version and library/version requirements), so as to not run into conflicts with your global python install.

Both you and your colleague can work from within your venv's with full confidence that you are using compatible libraries and functionality.

It is very straightforward to set up, effectively you just create a new folder with its own python install, and then any libraries installed from within the venv are stored there. This local version of python only sees these libraries. This is what a requirements.txt file for a project does - it defines the libraries and version requirements of the project.

When you are done with it you can easily delete the folder.


Steps:

  1. Create a virtual environment named e.g. /my/env:
    python -m venv /my/env --upgrade-deps
    
  2. Activate your venv (the specific command depends on your OS).
  3. Install your project dependencies to the venv:
    pip install -r requirements.txt
    

You can easily create a requirements.txt file like so:

pip install pipreqs

pipreqs /path/to/project

Manual solution

You could manually change the datatype of the column to one recognised by the earlier version of pandas (e.g. freetext) before pickling it.

iacob
  • 20,084
  • 6
  • 92
  • 119
  • Thanks for the answer. I am surprised to learn about the actual meaning of backward compatibility. Regarding the venv: I'm well aware of that option. If only I was in the situation where others wouldn't see it as a hassle to create envs. In general people here have one single env for everything. I will probably create an environment myself, but as my colleagues environment can change anytime (perhaps they decide to update/downgrade something) this is not a sustainable solution. The manual approach sounds interesting.Do you know where I can find what datatypes are supported per pandas version? – Saaru Lindestøkke Sep 09 '21 at 07:20