1

I am trying to understand a Udacity linear regression example which includes this:

data = np.loadtxt('data.csv',delimiter=',') # This is known to be a 2-columns, many rows array
X = data[:,:-1]
y = data[:,-1]

So, if I understand, X is a 1-column array capturing all the columns of data except the last one (so in effect capturing the first column only) and y is a 1-column array capturing only the last column of data.

My question is why not write the code this way:

X = data[:,0]
y = data[:,1]

Would it not be clearer / cleaner?

JDelage
  • 13,036
  • 23
  • 78
  • 112
  • 2
    Make a small test array of your own, and check the results. Are you sure `data` will always be (n,2) shaped? – hpaulj Oct 06 '20 at 22:38
  • 1
    `data[:,:-1]` is a 2D array, `data[:,0]` is a 1D array. You need a 2D array for the regression. – DYZ Oct 06 '20 at 22:44
  • @hpaulj - The example only admits a (n,2) shaped array. Of course I suppose they could produce a differently shaped array but then there would be no reason for the relevant columns to be at index 1 and -1... – JDelage Oct 06 '20 at 23:44
  • 1
    The use of `X` and `y` suggests that `X` is supposed to be a (n,m) array, and `y` a (n,). `X` would be data with `m` features, and `y` labels. This a common split in machine learning. We'd have to see the code that uses these variables to expand on that. – hpaulj Oct 06 '20 at 23:52

1 Answers1

1

X is an (n, 1) 2D array because slicing preserves the dimensionality. Alternative phrasings would be

X = data[:, :1]
X = data[:, 0, None]
X = data[:, 0].reshape(-1, 1)

y is an (n,) 1D array.

These shapes are likely important for the linear algebra used to implement the regression.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • I don't understand. Both X and y are 1 column x many rows, right? Are you telling me they're treated as different in nature despite this? – JDelage Oct 06 '20 at 23:49
  • 1
    @JDelage. `y` is not one column. Broadcasting to 2D would make it n column, one row. As another example of where shape matters, if you pass a pair of 2D arrays to `numpy.dot` , they will raise an error if the shapes aren't exactly right. You need to trace through where the quantities are used to understand the purpose of the shapes. – Mad Physicist Oct 07 '20 at 00:12
  • OK, I get it now: my proposition changes the shape of the matrix. In the given example, X is a 1 column x many rows array, whereas what I propose results in a simple 1D array (a vector). The elements are the same but the shape is different and obviously that's critical. – JDelage Oct 10 '20 at 17:47
  • 1
    @JDelage. Exactly. Also, broadcasting lines up dimensions on the right, so under certain circumstances, `Y` would be treated as a row vector. But `np.dot`, for example, would treat the 1D array differently depending on whether it was the first or second argument. There are other weird corner cases, but as you said, dimensions are critical. – Mad Physicist Oct 11 '20 at 05:25