Passing pandas NumPy arrays as feature vectors in scikit learn?

Question

I have a vector of 5 different values that I use as my sample value, and the label is a single integer of 0, 1, or 3. The machine learning algorithms work when I pass an array as a sample, but I get this warning. How do I pass feature vectors without getting this warning?

import numpy as np
from numpy import random

from sklearn import neighbors
from sklearn.model_selection import train_test_split
import pandas as pd

filepath = 'test.csv'

# example label values
index = [0,1,3,1,1,1,0,0]

# example sample arrays
data = []
for i in range(len(index)):
    d = []
    for i in range(6):
        d.append(random.randint(50,200))
    data.append(d)

feat1 = 'brightness'
feat2, feat3, feat4 = ['h', 's', 'v']
feat5 = 'median hue'
feat6 = 'median value'

features = [feat1, feat2, feat3, feat4, feat5, feat6]

df = pd.DataFrame(data, columns=features, index=index)
df.index.name = 'state'

with open(filepath, 'a') as f:
    df.to_csv(f, header=f.tell() == 0)

states = pd.read_csv(filepath, usecols=['state'])

df_partial = pd.read_csv(filepath, usecols=features)

states = states.astype(np.float32)
states = states.values
labels = states

samples = np.array([])
for i, row in df_partial.iterrows():
    r = row.values
    samples = np.vstack((samples, r)) if samples.size else r

n_neighbors = 5

test_size = .3
labels, test_labels, samples, test_samples = train_test_split(labels, samples, test_size=test_size)
clf1 = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf1 = clf1.fit(samples, labels)

score1 = clf1.score(test_samples, test_labels)

print("Here's how the models performed \nknn: %d %%" %(score1 * 100))

Warning:

"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). clf1 = clf1.fit(samples, labels)"

sklearn documentation for fit(self, X, Y)

Honey Gourami · Accepted Answer · 2019-07-23T01:59:36.147

2

Try replacing

states = states.values by states = states.values.flatten()

OR

clf1 = clf1.fit(samples, labels) by clf1 = clf1.fit(samples, labels.flatten()).

states = states.values holds the correct labels that were stored in your panda dataframe, however they are getting stored on different rows. Using .flatten() put all those labels on the same row. (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.ndarray.flatten.html)

In Sklearn's KNeighborsClassifier documentation (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), they show in their example that the labels must be stored on the same row: y = [0, 0, 1, 1].

edited Jul 23 '19 at 01:59

answered Jul 23 '19 at 01:32

Honey Gourami

150
11

Also I'm really confused by the documentation. What does it mean about the x input? That it is a 2d array of n_samples x n_samples or something else? or is it literally just a list [n_samples, n_samples] like it says? I put a screenshot of the sklearn documentation in my question above ^^^ – Ev C Jul 23 '19 at 19:16
1

You are very welcome! Regarding the **x input**, it is an array/matrix that holds points that have **n_features**. In your case, your points have 6 features (_Brightness_, _h_, _s_, _v_, _median hue_, and _median value_), so **n_features = 6**. Your *X* therefore holds 28 points having 6 features each, so its shape [n_samples, n_features] will be [28, 6].Try adding `print(samples)` right before `clf1 = clf1.fit(samples, labels)` in your code. It will help you visualize it better. – Honey Gourami Jul 24 '19 at 00:07

score 0 · Answer 2 · edited Jul 23 '19 at 13:13

0

When you retrieve data from dataframe states, it is stored in multiple rows (column vector) whereas it expected values in single row.

You can also try using ravel() function which is used to create a contiguous flattened array.

numpy.ravel(array, order = ‘C’) : returns contiguous flattened array (1D array with all the input-array elements and with the same type as it)

Try:

states = states.values.ravel() in place of states = states.values

edited Jul 23 '19 at 13:13

Paul Dawson

1,332
14
27

answered Jul 23 '19 at 12:08

SUN

181
5

So ravel() and flatten() are the same thing essentially? – Ev C Jul 23 '19 at 18:37
I did little bit research on this, Although ravel() and flatten() are two ways to convert a ndarray to 1D array, they have some differences. Ravel return reference to original array, and changes in array reflect in original array, whereas Flatten copy the original array and changes to array does not affect original array. As Ravel is just a reference of original array and completely avoid copying of data, it is faster than flatten. – SUN Jul 24 '19 at 09:40
References : https://www.geeksforgeeks.org/differences-flatten-ravel-numpy/ https://stackoverflow.com/questions/28930465/what-is-the-difference-between-flatten-and-ravel-functions-in-numpy – SUN Jul 24 '19 at 09:40

Passing pandas NumPy arrays as feature vectors in scikit learn?

2 Answers2