Data set appears to have dim 3, and estimator expected is <= 2

Question

I'm testing the following code sample.

# Load the required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Plot styling
import seaborn as sns; sns.set()  # for plot styling
# %matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
#Read the csv file
dataset=pd.read_csv('C:\\my_path\\CLV.csv')
#Explore the dataset
dataset.head()#top 5 columns
len(dataset) # of rows
#descriptive statistics of the dataset
dataset.describe().transpose()


#Visualizing the data - displot
plot_income = sns.distplot(dataset["INCOME"])
plot_spend = sns.distplot(dataset["SPEND"])
plt.xlabel('Income / spend')


#Violin plot of Income and Spend
f, axes = plt.subplots(1,2, figsize=(12,6), sharex=True, sharey=True)
v1 = sns.violinplot(data=dataset, x='INCOME', color="skyblue",ax=axes[0])
v2 = sns.violinplot(data=dataset, x='SPEND',color="lightgreen", ax=axes[1])
v1.set(xlim=(0,420))



#Using the elbow method to find the optimum number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()


##Fitting kmeans to the dataset with k=4
km4=KMeans(n_clusters=4,init='k-means++', max_iter=300, n_init=10, random_state=0)
y_means = km4.fit_predict(X)
#Visualizing the clusters for k=4
plt.scatter(X[y_means==0,0],X[y_means==0,1],s=50, c='purple',label='Cluster1')
plt.scatter(X[y_means==1,0],X[y_means==1,1],s=50, c='blue',label='Cluster2')
plt.scatter(X[y_means==2,0],X[y_means==2,1],s=50, c='green',label='Cluster3')
plt.scatter(X[y_means==3,0],X[y_means==3,1],s=50, c='cyan',label='Cluster4')
plt.scatter(km4.cluster_centers_[:,0], km4.cluster_centers_[:,1],s=200,marker='s', c='red', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('Annual income of customer')
plt.ylabel('Annual spend from customer on site')
plt.legend()
plt.show()


########
The plot shows the distribution of the 4 clusters. We could interpret them as the following customer segments:
1.  Cluster 1: Customers with medium annual income and low annual spend
2.  Cluster 2: Customers with high annual income and medium to high annual spend
3.  Cluster 3: Customers with low annual income
4.  Cluster 4: Customers with medium annual income but high annual spend
Cluster 4 straight away is one potential customer segment. However, Cluster 2 and 3 can be segmented further to arrive at a more specific target customer group. Let us now look at how the clusters are created when k=6:
########


##Fitting kmeans to the dataset - k=6
km4=KMeans(n_clusters=6,init='k-means++', max_iter=300, n_init=10, random_state=0)
y_means = km4.fit_predict(X)
#Visualizing the clusters
plt.scatter(X[y_means==0,0],X[y_means==0,1],s=50, c='purple',label='Cluster1')
plt.scatter(X[y_means==1,0],X[y_means==1,1],s=50, c='blue',label='Cluster2')
plt.scatter(X[y_means==2,0],X[y_means==2,1],s=50, c='green',label='Cluster3')
plt.scatter(X[y_means==3,0],X[y_means==3,1],s=50, c='cyan',label='Cluster4')
plt.scatter(X[y_means==4,0],X[y_means==4,1],s=50, c='magenta',label='Cluster5')
plt.scatter(X[y_means==5,0],X[y_means==5,1],s=50, c='orange',label='Cluster6')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1],s=200,marker='s', c='red', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('Annual income of customer')
plt.ylabel('Annual spend from customer on site')
plt.legend()
plt.show()

When I get to this line:

from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)

I get the following error:

ValueError: Found array with dim 3. Estimator expected <= 2.

The data is available for download here:

https://github.com/sowmyacr/kmeans_cluster/blob/master/CLV.csv

The data set should have 2 dimensions but Python seems to think it has 3, for some reason. Can someone explain what's going on here? Also, how do I fix this? Thanks.

[scikit-learn expects 2d num arrays for the training dataset for a fit function. The dataset you are passing in is a 3d array you need to reshape the array into a 2d.](https://stackoverflow.com/questions/34972142/sklearn-logistic-regression-valueerror-found-array-with-dim-3-estimator-expec?answertab=votes#tab-top) — Basile, Jun 11 '19 at 21:40
That's all I can point you to because the definition of `X` used in `km.fit(X)` is missing. Also your code says `km.fit(X,y)` which must be a typo — Basile, Jun 11 '19 at 22:10
Yeah, that was a mistake; I just fixed it. How can it be a 3D array? I have rows and columns; this is 2D. I still don't understand what's wrong here. Even if I run this: dataset.shape. I get this: (303, 2) — ASH, Jun 11 '19 at 22:15
Why are you including all those plot commands, when the error does not occur there? Focus to question and code. Show the traceback so we can see clearly where the error occurs. If we think the problem occurs in the `fit(X)` call, then we need to look at `X`. I don't see where that's created. — hpaulj, Jun 11 '19 at 23:03
The above mentioned code is running fine in python 3 with no error as mentioned above — Abhishek Kumar, Jun 12 '19 at 05:41

score 0 · Answer 1 · answered Jun 12 '19 at 17:39

I got it working with this:

x1 = np.array(dataset["INCOME"])
x2 = np.array(dataset["SPEND"])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)

So, the script is like this.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
#Plot styling
import seaborn as sns; sns.set()  # for plot styling
# %matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
#Read the csv file
dataset=pd.read_csv('C:\\path_here\\customer_segmentation.csv')
#Explore the dataset
dataset.head()#top 5 columns
len(dataset) # of rows
#descriptive statistics of the dataset
dataset.describe().transpose()


#Visualizing the data - displot
plot_income = sns.distplot(dataset["INCOME"])
plot_spend = sns.distplot(dataset["SPEND"])
plt.xlabel('Income / spend')


#Violin plot of Income and Spend
f, axes = plt.subplots(1,2, figsize=(12,6), sharex=True, sharey=True)
x1 = sns.violinplot(data=dataset, x='INCOME', color="skyblue",ax=axes[0])
x2 = sns.violinplot(data=dataset, x='SPEND',color="lightgreen", ax=axes[1])
x1.set(xlim=(0,420))


# https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f
# https://pythonprogramminglanguage.com/kmeans-elbow-method/
x1 = np.array(dataset["INCOME"])
x2 = np.array(dataset["SPEND"])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)



#Using the elbow method to find the optimum number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()

Data set appears to have dim 3, and estimator expected is <= 2

1 Answers1