Sample two pandas dataframes the same way

Question

I'm doing a machine learning computations having two dataframes - one for factors and other one for target values. I have to split both into training and testing parts. It seems to me that I've found the way but I'm looking for more elegant solution. Here is my code:

import pandas as pd
import numpy as np
import random

df_source = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('AB'))
df_target = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('CD'))

rows = np.asarray(random.sample(range(0, len(df_source)), 2))

df_source_train = df_source.iloc[rows]
df_source_test = df_source[~df_source.index.isin(df_source_train.index)]
df_target_train = df_target.iloc[rows]
df_target_test = df_target[~df_target.index.isin(df_target_train.index)]

print('rows')
print(rows)
print('source')
print(df_source)
print('source train')
print(df_source_train)
print('source_test')
print(df_source_test)

---- edited - solution by unutbu (midified) ---

np.random.seed(2013)
percentile = .6
rows = np.random.binomial(1, percentile, size=len(df_source)).astype(bool)

df_source_train = df_source[rows]
df_source_test = df_source[~rows]
df_target_train = df_target[rows]
df_target_test = df_target[~rows]

score 15 · Answer 1 · edited Feb 17 '19 at 10:59

15

Below you can find my solution, which doesn't involve any extra variables.

Use .sample method to get sample of your data
Use .index method on sample, to get indexes
Apply slice()ing by index for second dataframe

E.g. Let's say you have X and Y and you want to get 10 pieces sample on each. And it should be same samples, of course

X_sample = X.sample(10)
y_sample = y[X_sample.index]

edited Feb 17 '19 at 10:59

letsintegreat

3,328
4
18
39

answered Feb 17 '19 at 08:15

Alexander Tverdohleb

433
4
13

Very good solution, but over-reliant on defaults ("Explicit is better than implicit"). So, I'd add explicit `replace=False` in `sample` to be sure we avoid data leak. And it will not work without explicit `loc` (`y.loc[X_sample.index, :]) though (apparently pandas default axis changed here and axis 1 is now the default:). – mirekphd Jan 05 '21 at 10:22
The method doesn't work if the index contains duplicate values, in which case `y_sample` might contain multiple rows for a single row in `X_sample` – jjurm May 05 '23 at 01:47

unutbu · Accepted Answer · 2013-06-23T13:19:03.500

If you make rows a boolean array of length len(df), then you can get the True rows with df[rows] and get the False rows with df[~rows]:

import pandas as pd
import numpy as np
import random
np.random.seed(2013)

df_source = pd.DataFrame(
    np.random.randn(5, 2), index=range(0, 10, 2), columns=list('AB'))

rows = np.random.randint(2, size=len(df_source)).astype('bool')

df_source_train = df_source[rows]
df_source_test = df_source[~rows]

print(rows)
# [ True  True False  True False]

# if for some reason you need the index values of where `rows` is True
print(np.where(rows))  
# (array([0, 1, 3]),)

print(df_source)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 4 -1.320541  0.679631
# 6  0.833612  0.492572
# 8  1.555721  1.741279

print(df_source_train)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 6  0.833612  0.492572

print(df_source_test)
#           A         B
# 4 -1.320541  0.679631
# 8  1.555721  1.741279

thanx! Because of I have to use some percentile I've modified line started with rows = ... — Viacheslav Nefedov, Jun 23 '13 at 12:44
In that case, you could use `rows = np.random.binomial(1, percentile*100, size=len(df_source))`. — unutbu, Jun 23 '13 at 13:20
Or rather, `rows = np.random.binomial(1, percentile*100, size=len(df_source)).astype('bool')` — unutbu, Jun 23 '13 at 14:37

score 3 · Answer 3 · answered Sep 16 '20 at 08:03

I like the Alexander answer but I will add an index reset before sampling. The full code:

# index reset
X.reset_index(inplace=True, drop=True)
y.reset_index(inplace=True, drop=True)
# sampling
X_sample = X.sample(10)
y_sample = y[X_sample.index]

Reset of the index is used to not have problem with matching.

score 1 · Answer 4 · answered Jan 12 '22 at 20:16

1

I like answers from Alexander and pplonski. Just want to add that accessing indices might need iloc as follows:

y_sample = y.iloc[X_sample.index]

answered Jan 12 '22 at 20:16

VGonline

11
1

score 0 · Answer 5 · answered Feb 13 '21 at 02:47

0

I think an even simpler solution is:

from sklearn.model_selection import train_test_split

df_source_train, df_source_test, df_target_train, df_target_test = train_test_split(df_source, df_target, train_size=.6)

answered Feb 13 '21 at 02:47

B. Bogart

998
6
15

Sample two pandas dataframes the same way

5 Answers5

Linked