1

I need to unpack a pkl file, but since I'm not familiar with pickle and pandas, I'm having a very hard time trying to do that.

The content of the pkl file is something like:

{
'woodi': array([-0.07377538,  0.01810472,  0.03796827, -0.01185564, -0.12605625,
   -0.03709966,  0.07863396,  0.04245366, -0.09158159, -0.01418831,
   -0.03165198, -0.01235643,  0.00833164, -0.08156401, -0.10466748,
    0.11343367, -0.1291647 ,  0.02277501, -0.12230705,  0.08400519,
    0.01631752, -0.03204752, -0.10115118,  0.01796065, -0.08914784,
    0.00336748,  0.02858992,  0.13387977, -0.01711662, -0.05058149,
    0.09866285,  0.00623399, -0.11368696,  0.03389056,  0.03049786,
   -0.11235228,  0.03964651,  0.18348881,  0.00356622, -0.09299972,
    0.11804404,  0.10598116,  0.04603285,  0.10211086, -0.07094006,
    0.19667923, -0.22645354, -0.02930884, -0.21891772, -0.07495865]),
'bad-boy': array([-0.01525861, -0.0145514 ,  0.02207321,  0.01273549,  0.0034881 ,
       -0.00045474,  0.01104943,  0.00057228, -0.01515725,  0.00329882,
        0.01570324, -0.03927545,  0.00393151,  0.00355666, -0.00503297,
       -0.01088151, -0.0354947 , -0.010477  , -0.01945165,  0.0312498 ,
        0.00195288, -0.03095445, -0.00803227,  0.02864361, -0.01416729,
        0.00375061,  0.00546439,  0.03621898,  0.01337988, -0.03205173,
        0.00451094,  0.02180656, -0.02587242, -0.01276209,  0.02721113,
       -0.00075289, -0.00218841,  0.00531534, -0.0074188 ,  0.00312647,
        0.00424174,  0.02444418,  0.0222739 , -0.00477895,  0.02220114,
        0.03402764, -0.02423164,  0.00724037, -0.03526915,  0.01470344]),
...
}

I need to get the words and the real-valued vectors for each word and create a csv file... The content of the csv file must look like:

woodi -0.07377538 0.01810472 ... -0.07495865
bad-boy -0.01525861 -0.0145514 ... 0.01470344

I have tried this python code:

import pickle
import pandas as pd

fin = 'SGlove.pkl'
fout = 'SGlove.csv'

words, embeddings = pickle.load(open(fin, 'rb'), encoding='latin1')

m, n = embeddings.shape
print("Emebddings contains {} words embedded as vectors of length {}".format(m, n))

df = pd.DataFrame(embeddings)
df.insert(0, "word", words)
df.to_csv(fout, header=False, index=False, sep=" ")

But I get the following error message:

Traceback (most recent call last):
  File "pkl_to_csv.py", line 10, in <module>
    words, embeddings = pickle.load(open(fin, 'rb'), encoding='latin1')
ValueError: too many values to unpack (expected 2)
  • What have you tried so far? Please see how to create a [mcve] – G. Anderson Feb 22 '19 at 22:52
  • 1
    Please `print()` what `pickle.load()` is returning then [edit] your question and add it. It's probably returning a dictionary which is what is causing the error, but it's hard to tell for sure and suggest what to do without the pkl file to test with... – martineau Feb 22 '19 at 23:17
  • Have a glance at [this questio](https://stackoverflow.com/questions/4530611/saving-and-loading-objects-and-using-pickle) and answers. It looks like you're trying to unpickle the file handler, rather than opening the file then using `pickle.load()` to get the contents – G. Anderson Feb 22 '19 at 23:20
  • 1
    The pickle file is available here https://github.com/SenticNet/word-representations-for-sentiment-analysis/tree/master/Results – Jonnathan Carvalho Feb 22 '19 at 23:21

3 Answers3

0

I think the problem is that pickle.load() is returning a Python dictionary and that's causing the ValueError.

I tested this with the SGlove.pkl file you provided a link to and that premise appears to be true, however there doesn't seem to be a key in the dictionary that pickle.load() is returns that corresponds to 'embeddings', so that has prevented me from taking things any further.

Anyway, the code below shows generally how to extract the two values which (I initially thought) you wanted out of what load() is returning. Please describe what in the dictionary corresponds to the 'enbeddings' key?

Note: I have uploaded a list of the keys that are in the dictionary being returned — here's a link to the text file.

import pickle

fin = 'SGlove.pkl'

data_dict = pickle.load(open(fin, 'rb'), encoding='latin1')

words = data_dict['woodi']
embeddings = data_dict['embeddings'] # -> KeyError: 'embeddings'
martineau
  • 119,623
  • 25
  • 170
  • 301
  • i'm trying to open pickle file and got _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified. – user5520049 Mar 28 '21 at 14:21
  • @user5520049: See [Persistence of External Objects](https://docs.python.org/3/library/pickle.html#persistence-of-external-objects) in the documentation. – martineau Mar 28 '21 at 15:29
  • appreciating your reply, so you mean that this error because pickle objects have an external persistent ID .. if I trained my dataset using a pre-trained model with 10 epochs. Does that mean I couldn't read each epoch separately ? or the data between epochs are connected ? each epoch containes data.pkl file – user5520049 Mar 28 '21 at 19:20
  • @user5520049: All I know is it means there's a persistent ID in the pickle data which means that a user-defined `persistent_id()` method was involved when it was created — and you'll need to supply a `persistent_load()` function when you unpickle it to get the data back. Seems like whoever pickled the data should have provided you with that information and code for the latter function. Sorry, dunno nothin' about training datasets and their epochs. – martineau Mar 28 '21 at 19:47
0

martineau is most of the way there. pickle.load() returns a dictionary that you need to do additional work on to get the words and embeddings.

You can start with

import pickle

fin = 'SGlove.pkl'

data_dict = pickle.load(open(fin, 'rb'), encoding='latin1')

The list of words is then given by

word_list = list(data_dict.keys())

And you can then get a corresponding list of embeddings using

embedding_list = [data_dict[word] for word in word_list]

If you need a 2D array of embeddings for all words, you need to use np.concatenate or something similar on embedding_list to get one. For example, if you want embeddings to have shape [n_words, len_vector] (as you seem to want), you might use

embeddings = np.concatenate([item[None, :] for item in embedding_list], axis=0)
Jeremy McGibbon
  • 3,527
  • 14
  • 22
0

You can also load it directly into a pd data frame like so:

data_fname = 'yourFile.pkl'
df = pd.read_pickle(data_fname)
df.shape
hortajg
  • 50
  • 6