1

I am learning how to training and test sample by a dataframe. I review a solution post, but I can not understand some detail on code syntax .

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

Since msk will return an array of boolean. How can the msk be index of df and df[msk] return the actual numerical data? From my understanding, the index of df should be one string or an array of string.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
pipi
  • 705
  • 1
  • 8
  • 16
  • This is a purely `pandas` question, and it has nothing to do with `machine-learning` or `linear-regression` - kindly do not spam irrelevant tags (removed). – desertnaut Feb 23 '19 at 10:16

3 Answers3

0

In NumPy and Pandas, an array of booleans which is the same length as the array you're indexing is treated as a "mask," and selects the values where the mask is True.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
0

From the Pandas documentation on boolean indexing:

You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index

In your example by using df[msk] you get the lines in df that have the same indexes of the True values in the boolean vector msk, with df[~msk] those corresponding to the False values in msk.

user2314737
  • 27,088
  • 20
  • 102
  • 114
0
temp = np.array([1, 1, 1, 2, 2, 2])
import numpy as np
print(temp == 1)

Output:
[ True  True  True False False False]

Every element in temp is checked if it is equal to "1" and the boolean list is returned the same. What you are doing is just the opposite of this.

This is only possible with numpy. Python list will not support Boolean indexing. And applying the same on the python native list give "False" in return as this will compare the number to the whole list.

pranav dua
  • 51
  • 2