-3

I'm trying out a code on training datasets that I saw online, but can't seem to resolve the error as mentioned.

When I first ran the code, I get the above error as such:

ValueError  Traceback (most recent call last)
----> 2 knn_cv.fit(X_train, y_train)
<ipython-input-21-fb975450c609> in fit(self, X, y)
214         X = normalize(X, norm='l1', copy=False)
215 
--> 216         cv = check_cv(self.cv, X, y)
/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in 
check_cv(cv, y, classifier)
1980 
1981     if isinstance(cv, numbers.Integral):
-> 1982         if (classifier and (y is not None) and
1983                 (type_of_target(y) in ('binary', 'multiclass'))):
1984             return StratifiedKFold(cv)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The error seems to be in the function check_cv and looks like y_train is throwing the boolean, but I'm not exactly sure how to modify it. I know the cause is the 'and' statement which is usually modifiable but in this case the error resides within the check_cv function and I'm not sure on how to modify the statement. I tried the suggested action which was using a.any() or a.all() but it throws me a new error each time.

if I use y_train.any() it gives me an error:

 269     if y.ndim > 2 or (y.dtype == object and len(y) and
    270                       not isinstance(y.flat[0], str)):
--> 271         return 'unknown'  # [[[1, 2]]] or [obj_1] and not 
["label_1"]
    272 
    273     if y.ndim == 2 and y.shape[1] == 0:

TypeError: len() of unsized object

if I use y_train.all(), it says TypeError: 'KFold' object is not iterable

Another query suggested changing the array to a list, but it gives me np.array(y_train).tolist()
result: TypeError: len() of unsized object

Updated sklearn as well but doesn't seem to fix the error. Hoping someone can explain what's wrong or how I can modify the code (explanation as well if possible. I'm still a little unfamiliar with this part of the code)

training sample created using GoogleNews-vectors-negative300.bin.gz

y_train = array([ 3, 17, 14, 14, 5, 13,... 0, 1, 17, 16, 2])

y_train.shape() = (100,)

X_train = <100x5100 sparse matrix of type '' with 10049 stored elements in Compressed Sparse Row format>

X = check_array(X_train, accept_sparse='csr', copy=True)
print(X)
(0, 679)    1.0
(0, 701)    1.0
(0, 1851)   2.0
(0, 1889)   1.0
(0, 2498)   1.0
(0, 2539)   1.0
(0, 2589)   1.0
(0, 2679)   1.0...

 X.shape =  (100, 5100)

I attached the main part of the code, if you need a reference to the whole thing, I've provided the link below http://vene.ro/blog/word-movers-distance-in-python.html

def fit(self, X, y):
    if self.n_neighbors_try is None:
        n_neighbors_try = range(1, 6)
    else:
        n_neighbors_try = self.n_neighbors_try

    X = check_array(X, accept_sparse='csr', copy=True)
    X = normalize(X, norm='l1', copy=False)

    cv = check_cv(self.cv, X, y)
    knn = KNeighborsClassifier(metric='precomputed', algorithm='brute')
    scorer = check_scoring(knn, scoring=self.scoring)

    scores = []
    for train_ix, test_ix in cv:
        dist = self._pairwise_wmd(X[test_ix], X[train_ix])
        knn.fit(X[train_ix], y[train_ix])
        scores.append([
            scorer(knn.set_params(n_neighbors=k), dist, y[test_ix])
            for k in n_neighbors_try
        ])
    scores = np.array(scores)
    self.cv_scores_ = scores

    best_k_ix = np.argmax(np.mean(scores, axis=0))
    best_k = n_neighbors_try[best_k_ix]
    self.n_neighbors = self.n_neighbors_ = best_k

    return super(WordMoversKNNCV, self).fit(X, y)

 knn_cv = WordMoversKNNCV(cv=3,n_neighbors_try=range(1, 20), 
 W_embed=W_common, verbose=5, n_jobs=3)
 knn_cv.fit(X_train, y_train.all())

according to the author, I'm supposed to get this :

[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:   30.8s

[Parallel(n_jobs=3)]: Done  34 out of  34 | elapsed:  2.0min finished

[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:   25.7s

[Parallel(n_jobs=3)]: Done  33 out of  33 | elapsed:  2.9min finished

[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:   53.3s

[Parallel(n_jobs=3)]: Done  33 out of  33 | elapsed:  2.0min finished

WordMoversKNNCV(W_embed=memmap([[ 0.04283, -0.01124, ..., -0.05679, -0.00763],
       [ 0.02884, -0.05923, ..., -0.04744,  0.06698],
   ...,
       [ 0.08428, -0.15534, ..., -0.01413,  0.04561],
       [-0.02052,  0.08666, ...,  0.03659,  0.10445]]),
    cv=3, n_jobs=3, n_neighbors_try=range(1, 20), scoring=None,
    verbose=5)
bonedino
  • 1
  • 2
  • Find out which term in the if statement is producing a boolean array. Don't guess. Test. – hpaulj May 22 '19 at 11:38
  • @hpaulj it looks to me that y_train is causing the boolean, which is confusing to me since y_train is only a 1d array The common cause of triggering the boolean is usually the 'and' statement which should be typically resolved by replacing 'and' with "&" but for this case it seems to reside within the check_cv function which I've no clue on how to fix – bonedino May 23 '19 at 01:52
  • Can you provide a simpler example that still produces the error? That way the question and answer can be useful to other people who run across it. – Kyle May 23 '19 at 02:26
  • This question may help: https://stackoverflow.com/questions/10062954/valueerror-the-truth-value-of-an-array-with-more-than-one-element-is-ambiguous – Kyle May 23 '19 at 02:27
  • Show the sample of `y_train`. What is its shape, what does it contain? – Vivek Kumar May 23 '19 at 08:35
  • @KyleRoth tbh i'm not really sure how to simplify this, I've also stumbled similar queries https://stackoverflow.com/questions/12647471/the-truth-value-of-an-array-with-more-than-one-element-is-ambigous-when-trying-t The error is similar in the other queries where the '&' statement seems to throw off a boolean error, hence prompting the a.any() or a.all() however, in their case the statement is modifiable as per the solutions shown in both links, but for this code, it seems to occur within the function sklearn.check_cv – bonedino May 23 '19 at 09:30
  • another problem is I've searched the web but this author has been the only one with a full script tutorial for this aspect of training models I understand its a very specific scenario and I'm inexperienced in this area of python hence I'm hoping someone could explain why the error is throwing out is it because the code is outdated? or because I'm using jupyter? or is there something i'm overlooking – bonedino May 23 '19 at 09:36
  • @VivekKumar I've edited the question and included the y_train sample, hope it sheds light on the issue – bonedino May 23 '19 at 09:38

1 Answers1

2

You are using check_cv wrong. According to the documentation:-

check_cv(cv=’warn’, y=None, classifier=False):

cv : int, 
     cross-validation generator or an iterable, optional

y : array-like, optional
    The target variable for supervised learning problems.

classifier : boolean, optional, default False
             Whether the task is a classification task, 
             in which case stratified KFold will be used

So it wants y and estimator in input. But you are providing X and y which is wrong. Change the below lines:

cv = check_cv(self.cv, X, y)
knn = KNeighborsClassifier(metric='precomputed', algorithm='brute')

to:

knn = KNeighborsClassifier(metric='precomputed', algorithm='brute')
cv = check_cv(self.cv, y, knn)

Note the order of lines.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • thanks that worked out! But I'm a little confused, since now check_cv only contains y dataset does that mean i need to concate the X_train and y_train dataset? because the subsequent code demands for X[train_ix] and y[train_ix] – bonedino May 24 '19 at 06:19
  • @bonedino `check_cv` is just used to check the type of cross-validation iterator, if its compatible with the `y` or not. It does not affect any other piece of your code. – Vivek Kumar May 24 '19 at 07:02