16

I used Metaflow to load a Dataframe. It was successfully unpickled from the artifact store, but when I try to view its index using df.index, I get an error that says ModuleNotFoundError: No module named 'pandas.core.indexes.numeric'. Why?

I've looked at other answers with similar error messages here and here, which say that this is caused by trying to unpickle a dataframe with older versions of Pandas. However, my error is slightly different, and it is not fixed by upgrading Pandas (pip install pandas -U).

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
crypdick
  • 16,152
  • 7
  • 51
  • 74
  • "However, my error is slightly different" - your error is **completely unrelated**, as you are not trying to use `pickle` at all (and you would see it in the stack trace if Metaflow were doing so). That problem is a general issue with pickling that, in turn, has really nothing to do with Pandas. – Karl Knechtel Apr 06 '23 at 20:01
  • @KarlKnechtel Incorrect-- Metaflow artifact tracking works using `pickle` The dataframe was unpickled from the metaflow artifact store, which worked successfully, and then I tried to run a simple `df.index` operation. – crypdick Apr 07 '23 at 00:30
  • 1
    SO has a problem with hit-and-run Close votes based on incorrect assumptions, when clarification would've been more appropriate. – crypdick Apr 07 '23 at 00:39
  • I did not VTC based on some incorrect assumption. In particular, I did not VTC because of thinking the problem was unrelated to Pickle (although I did think that). I did VTC because I personally feel that Stack Overflow should not host bug reports. However, [consensus is firmly against me](https://meta.stackoverflow.com/questions/404646) so I have undone that and attempted to improve the question as much as possible. – Karl Knechtel Apr 07 '23 at 04:06

4 Answers4

16

This issue is caused by the new Pandas 2.0.0 release breaking backwards compatibility with Pandas 1.x, although I don't see this documented in the release notes. The solution is to downgrade pandas to the 1.x series: pip install "pandas<2.0.0"

crypdick
  • 16,152
  • 7
  • 51
  • 74
  • 2
    "although I don't see this documented in the release notes." That's because it wasn't supposed to be used in the first place. See the top level of [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) documentation: "The pandas.core, pandas.compat, and pandas.util top-level modules are PRIVATE. Stable functionality in such modules is not guaranteed." You should report this as a bug against metaflow. – Karl Knechtel Apr 06 '23 at 19:58
  • 3
    (A pickled Dataframe, of course, is quite likely to contain references to such private/internal Pandas modules, because they were likely used in creating the Dataframe.) – Karl Knechtel Apr 06 '23 at 20:03
  • @KarlKnechtel I never claimed that this is a bug against Pandas. I am simply reporting a solution to a cryptic error I encountered while doing a basic df operation. This isn't even a bug against Metaflow, either, unless Pandas happens to push out a major release (happened this week for the first time in 3 years). – crypdick Apr 07 '23 at 00:32
  • I initially inferred that it was a bug against Metaflow because they should not have written code using that package, at all, for any reason. However, since they are wrapping the pickling process, they presumably don't really have a choice. More to the point: pickled Dataframes should be treated as **incompatible by default** across **any** two different versions of Pandas. This is in the nature of Pickle: it necessarily reaches into undocumented and private APIs, which library maintainers hold themselves free to break at any time (as is perfectly expected under semver). – Karl Knechtel Apr 07 '23 at 04:13
  • Hi all, — to this day, pickle remains the best way ( and I think, only way, in the most general cases? ) to enure that savin andrestoring a DataFrame to disk is the identity map. Yes, everyone knows that pickles will break (by design and definition) across versions (and an update to ANYTHING a data frame touches can cause this). Many libraries opt for this fragile pickle solution because alternatives are somehow worse. – MRule May 10 '23 at 13:42
  • What do I mean by worse? Well, `.json` and `.hdf5` were the most flexible options, last I checked, but even these would run into trouble if you were doing anything slightly fancy. For example, hierarchical rows or columns, or storing anything other than a simple primitive type, or mixing types in a weird way within a column, etc. – MRule May 10 '23 at 13:45
  • And then you need to decide: do I enforce these restrictions at all times, or only when I'm about to serialize to disk? If its at all times, have you wrapped your DataFrame (in classic OOP style) so that all actions a user *can* take enforce the required invariants? Or, if its just when serializing: Have you written clear error messages and provided adequate helper routines to guide a user who is surprised when they suddenly can't save their data? Or, if you silently try to coerce the dataframe into e.g. a HDF5-safe format, will users encounter data loss or unwelcome surprises? – MRule May 10 '23 at 13:46
  • 2
    Probably the "correct" way to do this is to extend DataFrame into various classes like HDF5SafeDataFrame, JSONSafeDataFrame, etc., but this may require quite a bit of extra legwork if you want these derived classes to interoperate correctly with third-party code that expects a vanilla DataFrame? So yeah. I think there is a pretty strong argument that, in some cases, Pickle *is* the correct design choice. It has drawbacks, but sometimes the alternatives are just worse. – MRule May 10 '23 at 13:48
15

Try using the pandas.read_pickle() method to load the file instead of the pickle module:

import pandas as pd

df = pd.read_pickle("file.pkl")

The pandas method should provide compatibility to read older files, and "is only guaranteed to be backwards compatible to pandas 0.20.3 provided the object was serialized with to_pickle." My tests with pandas-1.x show it can also read some files written from the pickle module too.

Mike T
  • 41,085
  • 18
  • 152
  • 203
0

Try using pd.compat.pickle_compat.load() as that was only solution in my case:

import pandas as pd

df = pd.compat.pickle_compat.load('file.pkl') 
toyota Supra
  • 3,181
  • 4
  • 15
  • 19
MR42
  • 1
0

So i dont know why this works but joblib.load was failing to read the pickle with the same error "module named 'pandas.core.indexes.numeric'" then i installed prefect and simple_salesforce and some how it now works... not sure why but i think worth mentioning

Clinton Woods
  • 249
  • 1
  • 2
  • 11