0

So I have this in Matplotlib.

plt.scatter(X[: , 0:1][Y == 0], X[: , 2:3][Y==0])
plt.scatter(X[: , 0:1][Y == 1], X[: , 2:3][Y==1])
plt.scatter(X[: , 0:1][Y == 2], X[: , 2:3][Y==2])

I'd like to know if there's a better way to loop instead of this:

for i in range(3):
  plt.scatter(X[: , 0:1][Y == i], X[: , 2:3][Y==i])

MVCE:

# CSV: https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv
data = np.loadtxt('/content/drive/My Drive/Colab Notebooks/Machine Learning/iris.csv', skiprows=1, delimiter=',')

X = data[:, 0:4]
Y = data[:, 4:5]

# Scatter
for i in range(len(np.intersect1d(Y, Y))):
  plt.scatter(X[: , 0:1][Y == i], X[: , 3:4][Y==i])

# map(lambda i: plt.scatter(X[: , 0:1][Y == i], X[: , 2:3][Y==i]), range(3))

plt.title("Scatter Sepal Length / Petal Width ")
plt.legend(('Setosa', 'Versicolor', 'Virginica'))
plt.show()

Sharki
  • 404
  • 4
  • 23
  • 2
    This already very compact. You could map a lambda function to the range(3) iterable to save a line, but this does have any benefit. What are you trying to make better? I do not see an obvious error. – cmosig May 18 '20 at 20:02
  • Our teacher told us we shouldn't use loops when we're using Numpy, so I assumed that maybe matplotlib works like numpy, that magically there would be an attribute for the method that could iterate that increasing "y" how could I do that with map()? – Sharki May 18 '20 at 20:39
  • 1
    something like `map(lambda i: plt.scatter(X[: , 0:1][Y == i], X[: , 2:3][Y==i]), range(3))`. This should work. I have not tested this though. (I like the for-loop more. looks cleaner) – cmosig May 18 '20 at 21:07
  • 1
    Maybe you should ask specifically for numpy solution and set the `numpy` tag, if that's what you want :) – cmosig May 18 '20 at 21:10
  • @cmosig It seems it doesn't work :( – Sharki May 18 '20 at 21:23
  • Show the shapes of your data. You can often get away with using the columns, no (external) looping needed. – Mad Physicist May 18 '20 at 21:53
  • X= (150, 4), Y= (150, 1) – Sharki May 18 '20 at 21:55
  • You need to fully specify your problem. On the one hand, you say you want to loop; on the other, you say your teacher told you you shouldn't use loops when using Numpy. Should the graphs remain separate or can they be combined? etc. – jpf May 19 '20 at 11:01
  • Please show an MCVE. Something I can paste into my editor and run as-is. I am pretty sure that this can be done with a one-liner, but I need to see your data. – Mad Physicist May 19 '20 at 14:16
  • @MadPhysicist Sorry for the delay, updated! – Sharki May 19 '20 at 16:54
  • I managed to get a hold of `iris.csv` online, but in future, please post a sample dataset. All you need is 10-15 lines. Don't expect people to have to go offsite to answer your questions. The key is to make the entire thing copy-and-pastable from your question. – Mad Physicist May 19 '20 at 17:15
  • The CSV is in my drive, there's no much I can do as far as I know, and if I can, I'm sorry. I even put the dataset in a comment. Also sorry for my bad english. – Sharki May 19 '20 at 17:16
  • 1
    @Sharki. An MCVE means extracting a small piece of data that is *representative* of the actual problem. It does not mean copy and pasting the whole problem, data and all. It's an art form that most beginners have trouble with because it requires intuiting the minimum necessary to represent the actual problem, and most beginners have trouble identifying the problem. – Mad Physicist May 20 '20 at 01:15
  • @Sharki. I've updated my secondary answer, and the question it inspired, with what I think is a *much* simpler solution. Turns out your entire code can be written in about 3 lines with no explicit looping. You are allowed to change your selected answer at any time, and I think you should if you like the changes. – Mad Physicist May 20 '20 at 14:12
  • Sorry for the delay, I saw it in the morning but coulnd't reply because I had no time to, thanks a lot for your answer. I'm so sorry for any inconvenience you had. Thanks! – Sharki May 20 '20 at 14:22

2 Answers2

1

Probably the simplest way to display your data is with a single plot containing multiple colors.

The key is to label the data more efficiently. You have the right idea with np.intersect1d(Y, Y), but though clever, this not the best way to set up unique values. Instead, I recommend using np.unique. Not only will that remove the need to hard-code the argument to plt.legend, but the return_inverse argument will allow you to construct attributes directly.

A minor point is that you can index single columns with a single index, rather than a slice.

For example,

X = np.loadtxt('iris.csv', skiprows=1, delimiter=',', usecols=[0, 1, 2, 3])
Y = np.loadtxt('iris.csv', skiprows=1, delimiter=',', usecols=[4], dtype=str)

labels, indices = np.unique(Y, return_inverse=True)
scatter = plt.scatter(X[:, 0], X[:, 2], color=indices)

The array indices indexes into the three unique values in labels to get the original array back. You can therefore supply the index as a label for each element.

Constructing a legend for such a labeled dataset is something that matplotlib fully supports out of the box, as I learned from matplotlib add legend with multiple entries for a single scatter plot, which was inspired by this solution. The gist of it is that the object that plt.scatter returns has a method legend_elements which does all the work for you:

plt.legend(scatter.legend_elements()[0], labels)

legend_elements returns a tuple with two items. The first is handle to a collection of elements with distinct labels that can be used as the first argument to legend. The second is a set of default text labels based on the numerical labels you supplied. We discard these in favor of our actual text labels.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
1

You can do a much better job with the indexing by splitting the data properly.

The indexing expression X[:, 0:1][Y == n] extracts a view of the first column of X. It then applies the boolean mask Y == n to the view. Both steps can be done more concisely as a single step: X[Y == n, 0]. This is a bit inefficient since you will do this for every unique value in Y.

My other solution called for np.unique to group the labels. But np.unique works by sorting the array. We can do that ourselves:

X = np.loadtxt('iris.csv', skiprows=1, delimiter=',', usecols=[0, 1, 2, 3])
Y = np.loadtxt('iris.csv', skiprows=1, delimiter=',', usecols=[4], dtype=str)

ind = np.argsort(Y)
X = X[ind, :]
Y = Y[ind]

To find where Y changes, you can apply an operation like np.diff, but tailored to strings:

diffs = Y[:-1] != Y[1:]

The mask can be converted to split indices with np.flatnonzero:

inds = np.flatnonzero(diffs) + 1

And finally, you can split the data:

data = np.split(X, inds, axis= 0)

For good measure, you can even convert the split data into a dictionary instead of a list:

labels = np.concatenate(([Y[0]], Y[inds]))
data = dict(zip(labels, data))

You can plot with a loop, but much more efficiently now.

for label, group in data.items():
    plt.scatter(group[:, 0], group[:, 2], label=label)
plt.legend(labels)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264