1

I have the following

(Pdb) training
array(<418326x223957 sparse matrix of type '<type 'numpy.float64'>'
    with 165657096 stored elements in Compressed Sparse Row format>, dtype=object)
(Pdb) training.shape
()

Why is there no shape information?

EDIT: this was what I've done:

training, target, test, projectids = generate_features(outcomes, projects, resources)
target = np.array([1. if i == 't' else 0. for i in target])
projectids = np.array([i for i in projectids])

print 'vectorizing training features'
d = DictVectorizer(sparse=True)
training = d.fit_transform(training[:10].T.to_dict().values())
#test_data = d.fit_transform(training.T.to_dict().values())
test_data = d.transform(test[:10].T.to_dict().values())

print 'training shape: %s, %s' %(training.shape[0], training[1])
print 'test shape: %s, %s' %(test_data.shape[0], test_data[1])

print 'saving vectorized instances'
with open(filename, "wb") as f:
    np.save(f, training)
    np.save(f, test_data)
    np.save(f, target)
    np.save(f, projectids)

At this point of time, my training's shape was still (10, 121).

Later on, I just reinitialize the 4 variables by

with open("../data/f1/training.dat", "rb") as f:
    training = np.load(f)
    test_data = np.load(f)
    target = np.load(f)
    projectids = np.load(f)

but the shape was gone.

emesday
  • 6,078
  • 3
  • 29
  • 46
goh
  • 27,631
  • 28
  • 89
  • 151
  • You have to provide more context. There is not enough information to infer what is going on. At the very least, show the code you wrote to initialize the classifier and the training data. – lightalchemist Jul 04 '14 at 02:55
  • Sparse matrices aren't NumPy arrays. They're not even considered arraylikes; most NumPy routines have no idea what to do with one. It might be useful to look at http://stackoverflow.com/questions/8955448/save-load-scipy-sparse-csr-matrix-in-portable-data-format – user2357112 Jul 04 '14 at 04:47
  • It appears in this case that `numpy` does known what to do with a `sparse` matrix - wrap it in an object array. I'm using numpy 1.9dev if that makes a difference. – hpaulj Jul 04 '14 at 15:40
  • user2357112 is correct. I have modified and used pickle instead – goh Jul 07 '14 at 08:55

1 Answers1

7

There is shape information in

array(<418326x223957 sparse matrix of type '<type 'numpy.float64'>'
    with 165657096 stored elements in Compressed Sparse Row format>, dtype=object)

This is an array of one item, and 0 dimensions, hence the shape (). That one item is of dtype=object. Specifically it is a sparse array - with dimensions shown in the display <418...x22....

I was going ask about DictVectorizer and fit_transform, but that doesn't matter. It's the save and load operation that changes values.

My guess is that you are not loading the file that you just wrote.


Your np.save(f,training) is wrapping the sparse matrix in an np.array with dtype object. That's what you see on load.

training = training.item()

takes the sparse matrix out of that array wrapper.

Is 418326x223957 the shape of training with the full data set, and (10, 121) the shape for a reduced debugging set?

hpaulj
  • 221,503
  • 14
  • 230
  • 353