11

This is a problem which I suspect is common, but I haven't found a solution for it. What I want is quite simple, and seemingly technically feasible: I have a simple python class, and I want to store it on disc, instance and definition, in a single file. Pickle will store the data, but it doesn't store the class definition. One might argue that the class definition is already stored in my .py file, but I don't want a separate .py file; my goal is to have a self-contained single file that I could pop back into my namespace with a single line of code.

So yes, I know this possible using two files and two lines of code, but I want it in one file and one line of code. The reason why is because I often find myself in this situation; I'm working on some big dataset, manipulating it in python, and then having to write my sliced, diced and transformed data back into some preexisting directory structure. What I don't want is to litter these data-directories with ill-named python class stubs to keep my code and data associated, and what I want even less is the hassle of keeping track of and organizing all these little ad hoc classes defined on the fly in a script independently.

So the convenience isn't so much in code readability, but in effortless and unfudgable association between code and data. That seems like a worthy goal to me, even though I understand it isn't appropriate in most situations.

So the question is: Is there a package or code snippet that does such a thing, because I can't seem to find any.

martineau
  • 119,623
  • 25
  • 170
  • 301
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • 3
    I'd recommend avoiding pickle for long-term data storage: it is so very fragile. Try using a `dict` with json, or HDF5 with h5python. This doesn't answer your question, so it's a comment, but I honestly think it's a more viable long-term solution. – Seth Johnson Jul 17 '11 at 19:25
  • Hmm, it is precisely the self-documenting nature over the long term which I was looking for. Note that the code not changing is an integral part of the whole scheme; i scripted something and want to check on that a month later; the original script might be gone for all I care, but getting back myobject.myattribute should be plney self-documenting for my needs. – Eelco Hoogendoorn Jul 18 '11 at 06:08
  • Further; JSON is text (no?) which would be wildly inefficient, and HDF5 requires me to store a fileformat if I am to have any change of interpreting that data later, which im seeking to avoid. – Eelco Hoogendoorn Jul 18 '11 at 06:14
  • Pickle data isn't self-documenting: it isn't even guaranteed to be valid from one version of Python to the next, across computers, or anything. JSON is text, but speed in importing might not be your biggest concern. With either JSON or HDF5, I'm sure you can figure out a reasonable way of saving the data so it's extendable and readable later. – Seth Johnson Jul 18 '11 at 14:56
  • Attributes have names; thats all the documentation im asking for. If only pickle would save these attribute names, I wouldnt have to retype them. I dont care about inter-validity between python version or computers, since im the only person who will ever bother with these files. I dont worry about having to try to open my file with both 2.6 and 2.7; what I worry about is being left with a stream of bytes and not having a clue what it means. Speed in importing and storage space is a concern considering these files routinely run into the gigabytes. Can we get back to answering my question now? – Eelco Hoogendoorn Jul 18 '11 at 20:13
  • "a stream of bytes and not having a clue what it means" is exactly the description of Pickle. Pickled objects are bytecode (look up how Pickle works): if you change your class in any number of ways, that stream of code will be useless. – Seth Johnson Jul 18 '11 at 20:40
  • Does this answer your question? [Pickling a class definition](https://stackoverflow.com/questions/2626636/pickling-a-class-definition) – mkrieger1 Apr 17 '22 at 21:05

2 Answers2

12

If you use dill, it enables you to treat __main__ as if it were a python module (for the most part). Hence, you can serialize interactively defined classes, and the like. dill also (by default) can transport the class definition as part of the pickle.

>>> class MyTest(object):
...   def foo(self, x):
...     return self.x * x
...   x = 4
... 
>>> f = MyTest() 
>>> import dill
>>>
>>> with open('test.pkl', 'wb') as s:
...   dill.dump(f, s)
... 
>>> 

Then shut down the interpreter, and send the file test.pkl over TCP. On your remote machine, now you can get the class instance.

Python 2.7.9 (default, Dec 11 2014, 01:21:43) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> with open('test.pkl', 'rb') as s:
...   f = dill.load(s)
... 
>>> f
<__main__.MyTest object at 0x1069348d0>
>>> f.x
4
>>> f.foo(2)
8
>>>             

But how to get the class definition? So this is not exactly what you wanted. The following is, however.

>>> class MyTest2(object):
...   def bar(self, x):
...     return x*x + self.x
...   x = 1
... 
>>> import dill
>>> with open('test2.pkl', 'wb') as s:
...   dill.dump(MyTest2, s)
... 
>>>

Then after sending the file… you can get the class definition.

Python 2.7.9 (default, Dec 11 2014, 01:21:43) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> with open('test2.pkl', 'rb') as s:
...   MyTest2 = dill.load(s)
... 
>>> print dill.source.getsource(MyTest2)
class MyTest2(object):
  def bar(self, x):
    return x*x + self.x
  x = 1

>>> f = MyTest2()
>>> f.x
1
>>> f.bar(4)
17

Since you were looking for a one liner, I can do better. I didn't show you can send over the class and the instance at the same time, and maybe that's what you were wanting.

>>> import dill
>>> class Foo(object): 
...   def bar(self, x):
...     return x+self.x
...   x = 1
... 
>>> b = Foo()
>>> b.x = 5
>>> 
>>> with open('blah.pkl', 'wb') as s:
...   dill.dump((Foo, b), s)
... 
>>> 

It's still not a single line, however, it works.

Python 2.7.9 (default, Dec 11 2014, 01:21:43) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> with open('blah.pkl', 'rb') as s:
...   Foo, b = dill.load(s)
... 
>>> b.x  
5
>>> Foo.bar(b, 2)
7

So, within dill, there's dill.source, and that has methods that can detect dependencies of functions and classes, and take them along with the pickle (for the most part).

>>> def foo(x):
...   return x*x
... 
>>> class Bar(object):
...   def zap(self, x):
...     return foo(x) * self.x
...   x = 3
... 
>>> print dill.source.importable(Bar.zap, source=True)
def foo(x):
  return x*x
def zap(self, x):
  return foo(x) * self.x

So that's not "perfect" (or maybe not what's expected)… but it does serialize the code for a dynamically built method and it's dependencies. You just don't get the rest of the class -- but the rest of the class is not needed in this case. Still, it doesn't seem like what you wanted.

If you wanted to get everything, you could just pickle the entire session. And in one line (two counting the import).

>>> import dill
>>> def foo(x):
...   return x*x
... 
>>> class Blah(object):
...   def bar(self, x):
...     self.x = (lambda x:foo(x)+self.x)(x)
...   x = 2
... 
>>> b = Blah()
>>> b.x
2
>>> b.bar(3)
>>> b.x
11
>>> # the one line
>>> dill.dump_session('foo.pkl')
>>> 

Then on the remote machine...

Python 2.7.9 (default, Dec 11 2014, 01:21:43) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> # the one line
>>> dill.load_session('foo.pkl')
>>> b.x
11
>>> b.bar(2)
>>> b.x
15
>>> foo(3)
9

Lastly, if you want the transport to be "done" for you transparently (instead of using a file), you could use pathos.pp or ppft, which provide the ability to ship objects to a second python server (on a remote machine) or python process. They use dill under the hood, and just pass the code across the wire.

>>> class More(object):
...   def squared(self, x):
...     return x*x
... 
>>> import pathos
>>> 
>>> p = pathos.pp.ParallelPythonPool(servers=('localhost,1234',))
>>> 
>>> m = More()
>>> p.map(m.squared, range(5))
[0, 1, 4, 9, 16]

The servers argument is optional, and here is just connecting to the local machine on port 1234… but if you use the remote machine name and port instead (or as well), you'll fire off to the remote machine -- "effortlessly".

Get dill, pathos, and ppft here: https://github.com/uqfoundation

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • 1
    Thanks! Hadnt thought about this problem for a while anymore (havnt done so much data-exploration lately), but this nicely addressed my original problem. – Eelco Hoogendoorn Jan 23 '15 at 16:12
0

Pickle can't pickle python code, so I don't think this is possible at all with pickle.

>>> from pickle import *
>>> def A(object):
...     def __init__(self):
...             self.potato = "Hello"
...             print "Starting"
...                                                                                                                                                                  
>>> A.__code__                                                                                                                                                       
<code object A at 0xb76bc0b0, file "<stdin>", line 1>                                                                                                                
>>> dumps(A.__code__)                                                                                                                                                
Traceback (most recent call last):                                                                                                                                   
  File "<stdin>", line 1, in <module>                                                                                                                                
  File "/usr/lib/python2.6/pickle.py", line 1366, in dumps
    Pickler(file, protocol).dump(obj)
  File "/usr/lib/python2.6/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.6/pickle.py", line 306, in save
    rv = reduce(self.proto)
  File "/usr/lib/python2.6/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle code objects
Nick Craig-Wood
  • 52,955
  • 12
  • 126
  • 132
  • The first comment here http://stackoverflow.com/questions/2626636/pickling-a-class-definition has a link to a pickling of an entire interpreter state. Its not exactly what I want, but it does seem to do what im interested in under the hood. Wouldnt it be possible to automagically grab the string defining any given class, and pickle and eval that later? – Eelco Hoogendoorn Jul 18 '11 at 06:10