2

I am trying to use pickle to save few large datasets that I generate through other datasets. While dumping it does not give me any error but when I try to load these datasets pickle exits with an eof error. Below is the code that I run to save the datasets:

from scipy.stats.mstats import mode
trainingSetCustomers = pd.DataFrame({'visitFrequency': trainingSet.size(),'totalAmountSpent': trainingSet['amountSpent'].sum(),'totalProducts': trainingSet['productCount'].sum(),'firstVisit': trainingSet['visitDate'].min(),'lastVisit': trainingSet['visitDate'].max(),'visitType':trainingSet['visitType'].apply(f),'country': trainingSet['country'].apply(f),'isReferred':trainingSet['isReferred'].sum()}).reset_index()
p2 = pickle.Pickler(open("trainingSetCustomers.p","wb")) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
p2.clear_memo()
p2.dump(trainingSetCustomers)
print "Training Set saved" #Done
trainingResultSetCustomers = pd.DataFrame({'futureVisitFrequency': trainingResultSet.size(),'futureTotalAmountSpent': trainingResultSet['amountSpent'].sum(),'futureTotalProducts': trainingResultSet['productCount'].sum(),'firstVisit': trainingResultSet['visitDate'].min(),'lastVisit': trainingResultSet['visitDate'].max(),'visitType':trainingResultSet['visitType'].apply(f),'country': trainingResultSet['country'].apply(f),'isReferred':trainingResultSet['isReferred'].sum()}).reset_index()
p3 = pickle.Pickler(open("trainingResultSetCustomers.p","wb")) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
p3.clear_memo()
p3.dump(trainingResultSetCustomers)
print "trainingresult set saved" #Done

This runs without any errors and prints the messages. But when I run following code:

trainingResultSetCustomers = pickle.load( open( "trainingResultSetCustomers.p", "rb" ) )

It gives me an EoFError. I need to store 4 of these kinds of test sets and I am really confused as ot why this is happening. I am running it on IPython notebook through ssh if that makes any difference. Also if I try this with only 5 rows it works perfectly

Data Structure : As can be seen from the code, this dataframe is generated by the properties of a grouped object.

This is the error I get :

EOFError                                  Traceback (most recent call last)
<ipython-input-10-86d38895c564> in <module>()
      5 p = pickle.Pickler(o) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
      6 p.clear_memo()
----> 7 trainingset = pickle.load(o)
      8 o.close()
      9 print "done"

/usr/lib/python2.7/pickle.pyc in load(file)
   1376 
   1377 def load(file):
-> 1378     return Unpickler(file).load()
   1379 
   1380 def loads(str):

/usr/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_eof(self)
    878 
    879     def load_eof(self):
--> 880         raise EOFError
    881     dispatch[''] = load_eof
    882 
Sudh
  • 1,265
  • 2
  • 19
  • 30
  • Can you give a minimal example that we can run? Are you using the regular `pickle` or `jsonpickle`? – user4815162342 Nov 28 '14 at 13:18
  • regular pickle, I cant give the minimum example as the data is too huge. But I can tell you the structure of the data. – Sudh Nov 28 '14 at 14:07
  • The problem appears only on this huge dataseet, or also on a small one? – user4815162342 Nov 28 '14 at 15:04
  • I did not try it with a small dataset yet.. I ll try now and update the question – Sudh Nov 28 '14 at 15:28
  • @user4815162342 Yes it does work fine with 5 rows – Sudh Nov 28 '14 at 16:31
  • Try increasing the number of rows in your "toy" data set until you find the count that starts giving the error. Investigating this might reveal something else you are doing differently between the example that works and the one that doesn't. – user4815162342 Nov 28 '14 at 18:46
  • @user4815162342: Some how I dont thik length is the issue because I picked greater dataframes using the same code – Sudh Nov 28 '14 at 19:30

1 Answers1

0

In absence of some test code and version numbers, the only thing I can see is that you are using pandas.Dataframe objects. These guys often can need some special handling that is built-into pandas built-in pickling methods. I believe pandas gives both the to_pickle and the save method, which provide pickling for a Dataframe. See: How to store data frame using PANDAS, Python and links within.

And, depending on how large a Dataframe you are trying to pickle, and the versions of your dependencies, it could be hitting up against a 64-bit pickling bug. See: Pickling a DataFrame.

Also, if you are sending serialized data trough ssh, you might want to check that you aren't running into some sort of ssh packet limitation. If you are just executing the code through ssh, then this should not be a potential issue.

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • Yeah interestingly the error went away when I added a separate file.flush() after pickling to the file. Yeah I think the problem may have also been solved by to_pickle as well. I was running the code on server so ssh shouldn't be the issue. Accepting your answer as correct as it moved me in the right direction. – Sudh Nov 29 '14 at 15:44