7

I am using pickle to save an object graph by dumping the root. When I load the root it has all the instance variables and connected object nodes. However I am saving all the nodes in a class variable of type dictionary. The class variable is full before being saved but after I unpickle the data it is empty.

Here is the class I am using:

class Page():

    __crawled = {}

    def __init__(self, title = '', link = '', relatedURLs = []):
        self.__title = title
        self.__link = link
        self.__relatedURLs = relatedURLs
        self.__related = [] 

    @property
    def relatedURLs(self):
        return self.__relatedURLs

    @property
    def title(self):
        return self.__title

    @property
    def related(self):
        return self.__related

    @property
    def crawled(self):
        return self.__crawled

    def crawl(self,url):
        if url not in self.__crawled:
            webpage = urlopen(url).read()
            patFinderTitle = re.compile('<title>(.*)</title>')
            patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
            patFinderRelated = re.compile('<li><a href="([^"]*)"')

            findPatTitle = re.findall(patFinderTitle, webpage)
            findPatLink = re.findall(patFinderLink, webpage)
            findPatRelated = re.findall(patFinderRelated, webpage)
            newPage = Page(findPatTitle,findPatLink,findPatRelated)
            self.__related.append(newPage)
            self.__crawled[url] = newPage
        else:
            self.__related.append(self.__crawled[url])

    def crawlRelated(self):
        for link in self.__relatedURLs:
            self.crawl(link)

I save it like such:

with open('medTwiceGraph.dat','w') as outf:
    pickle.dump(root,outf)

and I load it like such:

def loadGraph(filename): #returns root
    with open(filename,'r') as inf:
        return pickle.load(inf)

root = loadGraph('medTwiceGraph.dat')

All the data loads except for the class variable __crawled.

What am I doing wrong?

Joel Green
  • 920
  • 9
  • 18
  • You don't seem to be using new-style classes: `class Page(object):`. – Blender May 19 '13 at 17:41
  • This did not help but where can I read more about new-style objects? I just started python coming from java, so I'm not totally sure what I'm doing yet. – Joel Green May 19 '13 at 17:46

4 Answers4

8

Python doesn't really pickle class objects. It simply saves their names and where to find them. From the documentation of pickle:

Similarly, classes are pickled by named reference, so the same restrictions in the unpickling environment apply. Note that none of the class’s code or data is pickled, so in the following example the class attribute attr is not restored in the unpickling environment:

class Foo:
    attr = 'a class attr'

picklestring = pickle.dumps(Foo)

These restrictions are why picklable functions and classes must be defined in the top level of a module.

Similarly, when class instances are pickled, their class’s code and data are not pickled along with them. Only the instance data are pickled. This is done on purpose, so you can fix bugs in a class or add methods to the class and still load objects that were created with an earlier version of the class. If you plan to have long-lived objects that will see many versions of a class, it may be worthwhile to put a version number in the objects so that suitable conversions can be made by the class’s __setstate__() method.

In your example you could fix your problems changing __crawled to be an instance attribute or a global variable.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
5

By default pickle will only use the contents of self.__dict__ and not use self.__class__.__dict__ which is what you think you want.

I say, "what you think you want" because unpickling an instance should not mutate class level sate.

If you want to change this behavior then look at __getstate__ and __setstate__ in the docs

Phil Cooper
  • 5,747
  • 1
  • 25
  • 41
1

For anyone interested, what I did was make a superclass Graph which contained an instance variable __crawled and moved my crawling functions into Graph. Page now only contains attributes describing the page and its related pages. I pickle my instance of Graph which contains all my instances of Page. Here is my code.

from urllib import urlopen
#from bs4 import BeautifulSoup
import re
import pickle

###################CLASS GRAPH####################
class Graph(object):
    def __init__(self,roots = [],crawled = {}):
        self.__roots = roots
        self.__crawled = crawled
    @property
    def roots(self):
        return self.__roots
    @property
    def crawled(self):
        return self.__crawled
    def crawl(self,page,url):
        if url not in self.__crawled:
            webpage = urlopen(url).read()
            patFinderTitle = re.compile('<title>(.*)</title>')
            patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
            patFinderRelated = re.compile('<li><a href="([^"]*)"')

            findPatTitle = re.findall(patFinderTitle, webpage)
            findPatLink = re.findall(patFinderLink, webpage)
            findPatRelated = re.findall(patFinderRelated, webpage)
            newPage = Page(findPatTitle,findPatLink,findPatRelated)
            page.related.append(newPage)
            self.__crawled[url] = newPage
        else:
            page.related.append(self.__crawled[url])

    def crawlRelated(self,page):
        for link in page.relatedURLs:
            self.crawl(page,link)
    def crawlAll(self,obj,limit = 2,i = 0):
        print 'number of crawled pages:', len(self.crawled)
        i += 1
        if i > limit:
            return
        else:
            for rel in obj.related:
                print 'crawling', rel.title
                self.crawlRelated(rel)
            for rel2 in obj.related:
                self.crawlAll(rel2,limit,i)          
    def loadGraph(self,filename):
        with open(filename,'r') as inf:
            return pickle.load(inf)
    def saveGraph(self,obj,filename):
        with open(filename,'w') as outf:
            pickle.dump(obj,outf)
###################CLASS PAGE#####################
class Page(Graph):
    def __init__(self, title = '', link = '', relatedURLs = []):
        self.__title = title
        self.__link = link
        self.__relatedURLs = relatedURLs
        self.__related = []      
    @property
    def relatedURLs(self):
        return self.__relatedURLs 
    @property
    def title(self):
        return self.__title
    @property
    def related(self):
        return self.__related
####################### MAIN ######################
def main(seed):
    print 'doing some work...'
    webpage = urlopen(seed).read()

    patFinderTitle = re.compile('<title>(.*)</title>')
    patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
    patFinderRelated = re.compile('<li><a href="([^"]*)"')

    findPatTitle = re.findall(patFinderTitle, webpage)
    findPatLink = re.findall(patFinderLink, webpage)
    findPatRelated = re.findall(patFinderRelated, webpage)

    print 'found the webpage', findPatTitle

    #root = Page(findPatTitle,findPatLink,findPatRelated)
    G = Graph([Page(findPatTitle,findPatLink,findPatRelated)])
    print 'crawling related...'
    G.crawlRelated(G.roots[0])
    G.crawlAll(G.roots[0])  
    print 'now saving...'
    G.saveGraph(G, 'medTwiceGraph.dat')
    print 'done'
    return G
#####################END MAIN######################

#'http://medtwice.com/am-i-pregnant/'
#'medTwiceGraph.dat'

#G = main('http://medtwice.com/menopause-overview/')
#print G.crawled


def loadGraph(filename):
    with open(filename,'r') as inf:
        return pickle.load(inf)

G = loadGraph('MedTwiceGraph.dat')
print G.roots[0].title
print G.roots[0].related
print G.crawled

for key in G.crawled:
    print G.crawled[key].title
Joel Green
  • 920
  • 9
  • 18
-2

Using dill can solve this problem.
dill package: https://pypi.python.org/pypi/dill
reference: https://stackoverflow.com/a/28543378/6301132

According Asker's code, into this:

#notice:open the file in binary require

#save
with open('medTwiceGraph.dat','wb') as outf:
    dill.dump(root,outf)
#load
def loadGraph(filename): #returns root
    with open(filename,'rb') as inf:
        return dill.load(inf)

root = loadGraph('medTwiceGraph.dat')

I wrote another example:

#Another example (with Python 3.x)

import dill
import os

class Employee: 

    def __init__ (self ,name='',contact={}) :
        self.name = name
        self.contact = contact

    def print_self(self):
        print(self.name, self.contact)

#save
def save_employees():
    global emp
    with open('employees.dat','wb') as fh:
        dill.dump(emp,fh)

#load
def load_employees():
    global emp
    if os.path.exists('employees.dat'):
        with open('employees.dat','rb') as fh:
            emp=dill.load(fh)

#---
emp=[]
load_employees()
print('loaded:')
for tmpe in emp:
    tmpe.print_self()

e=Employee() #new employee
if len(emp)==0:
    e.name='Jack'
    e.contact={'phone':'+086-12345678'}
elif len(emp)==1:
    e.name='Jane'
    e.contact={'phone':'+01-15555555','email':'a@b.com'}
else:
    e.name='sb.'
    e.contact={'telegram':'x'}
emp.append(e)

save_employees()
Community
  • 1
  • 1
nodewee
  • 11
  • 2
  • 4
    I'm the `dill` author. First off, I agree with you -- `dill` should be able to save the object's state correctly. However, I can't give you a +1 because your code snippet above is wrong. It would be better to **show** that `dill` can correctly serialize the OP's object… and what you have will do nothing but throw an error. – Mike McKerns May 06 '16 at 16:50
  • That code snippet is just a hint, But I confess I wasn't rigorous enough. Let me complement my answer. Thanks for your reply & dill package! – nodewee May 14 '16 at 12:32