4

I have a dataset with 2 features with the name pos_x and pos_y and I need to scatter plot the clustered data done with DBScan. Here is what I have tried for it:

dataset = pd.read_csv(r'/Users/file_name.csv')
    Data = dataset[["pos_x","pos_y"]].to_numpy()
    dbscan=DBSCAN()
    clusters =dbscan.fit(Data)
    p = sns.scatterplot(data=Data, x="pos_x", y="pos_y", hue=clusters.labels_, legend="full", palette="deep")
    sns.move_legend(p, "upper right", bbox_to_anchor=(1.17, 1.2), title='Clusters')
    plt.show()

however I get the following error for it. I appreciate if anyone can help me with it. Because as I know for the parameter x and y in scatter plot I should write the name of the features.

ValueError: Could not interpret value `pos_x` for parameter `x`
ttina
  • 87
  • 1
  • 9

1 Answers1

3

I think the error is caused by this part of the code:

Data = dataset[["pos_x","pos_y"]].to_numpy()

When you convert the dataframe to numpy, seaborn cannot access the columns as it should.

Try this:

dataset = pd.read_csv(r'/Users/file_name.csv')
Data = dataset[["pos_x","pos_y"]]
dbscan = DBSCAN()
clusters = dbscan.fit(Data.to_numpy())
p = sns.scatterplot(data=Data, x="pos_x", y="pos_y", hue=clusters.labels_, legend="full", palette="deep")
sns.move_legend(p, "upper right", bbox_to_anchor=(1.17, 1.2), title='Clusters')
plt.show()
Iran Ribeiro
  • 101
  • 6
  • Thank you for your answer. I did as what you said but because my dataset is huge like having around 2245672 data, it gets stuck and takes too long to execute. Isn't there any more efficient solution? – ttina Jun 24 '22 at 07:12
  • If the problem is with DBSCAN, you could try [this](https://stackoverflow.com/questions/52560683/dbscan-sklearn-is-very-slow). However, there is not much you can do about it if you really need to plot all these points. As stated in [this answer](https://stackoverflow.com/questions/4082298/scatter-plot-with-a-huge-amount-of-data), it is likely that most of the points will overlap in the figure. In that case, I recommend you to try this options: [sampling](https://stackoverflow.com/questions/45092124/scatter-plot-on-large-amount-of-data) – Iran Ribeiro Jun 24 '22 at 13:16
  • or using [joinplots](https://stackoverflow.com/questions/4082298/scatter-plot-with-a-huge-amount-of-data), or [PCA](https://stackoverflow.com/questions/71944846/plot-big-dataset-clusters-in-python). – Iran Ribeiro Jun 24 '22 at 13:17