15

I'm doing a machine learning computations having two dataframes - one for factors and other one for target values. I have to split both into training and testing parts. It seems to me that I've found the way but I'm looking for more elegant solution. Here is my code:

import pandas as pd
import numpy as np
import random

df_source = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('AB'))
df_target = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('CD'))

rows = np.asarray(random.sample(range(0, len(df_source)), 2))

df_source_train = df_source.iloc[rows]
df_source_test = df_source[~df_source.index.isin(df_source_train.index)]
df_target_train = df_target.iloc[rows]
df_target_test = df_target[~df_target.index.isin(df_target_train.index)]

print('rows')
print(rows)
print('source')
print(df_source)
print('source train')
print(df_source_train)
print('source_test')
print(df_source_test)

---- edited - solution by unutbu (midified) ---

np.random.seed(2013)
percentile = .6
rows = np.random.binomial(1, percentile, size=len(df_source)).astype(bool)

df_source_train = df_source[rows]
df_source_test = df_source[~rows]
df_target_train = df_target[rows]
df_target_test = df_target[~rows]
Viacheslav Nefedov
  • 2,259
  • 3
  • 15
  • 15

5 Answers5

15

Below you can find my solution, which doesn't involve any extra variables.

  1. Use .sample method to get sample of your data
  2. Use .index method on sample, to get indexes
  3. Apply slice()ing by index for second dataframe

E.g. Let's say you have X and Y and you want to get 10 pieces sample on each. And it should be same samples, of course

X_sample = X.sample(10)
y_sample = y[X_sample.index]
letsintegreat
  • 3,328
  • 4
  • 18
  • 39
  • Very good solution, but over-reliant on defaults ("Explicit is better than implicit"). So, I'd add explicit `replace=False` in `sample` to be sure we avoid data leak. And it will not work without explicit `loc` (`y.loc[X_sample.index, :]) though (apparently pandas default axis changed here and axis 1 is now the default:). – mirekphd Jan 05 '21 at 10:22
  • The method doesn't work if the index contains duplicate values, in which case `y_sample` might contain multiple rows for a single row in `X_sample` – jjurm May 05 '23 at 01:47
10

If you make rows a boolean array of length len(df), then you can get the True rows with df[rows] and get the False rows with df[~rows]:

import pandas as pd
import numpy as np
import random
np.random.seed(2013)

df_source = pd.DataFrame(
    np.random.randn(5, 2), index=range(0, 10, 2), columns=list('AB'))

rows = np.random.randint(2, size=len(df_source)).astype('bool')

df_source_train = df_source[rows]
df_source_test = df_source[~rows]

print(rows)
# [ True  True False  True False]

# if for some reason you need the index values of where `rows` is True
print(np.where(rows))  
# (array([0, 1, 3]),)

print(df_source)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 4 -1.320541  0.679631
# 6  0.833612  0.492572
# 8  1.555721  1.741279

print(df_source_train)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 6  0.833612  0.492572

print(df_source_test)
#           A         B
# 4 -1.320541  0.679631
# 8  1.555721  1.741279
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
3

I like the Alexander answer but I will add an index reset before sampling. The full code:

# index reset
X.reset_index(inplace=True, drop=True)
y.reset_index(inplace=True, drop=True)
# sampling
X_sample = X.sample(10)
y_sample = y[X_sample.index]

Reset of the index is used to not have problem with matching.

pplonski
  • 5,023
  • 1
  • 30
  • 34
1

I like answers from Alexander and pplonski. Just want to add that accessing indices might need iloc as follows:

y_sample = y.iloc[X_sample.index]
VGonline
  • 11
  • 1
0

I think an even simpler solution is:

from sklearn.model_selection import train_test_split

df_source_train, df_source_test, df_target_train, df_target_test = train_test_split(df_source, df_target, train_size=.6)
B. Bogart
  • 998
  • 6
  • 15