python Samples of a Dataframe, like pandas Dataframe.sample() but always selecting n adjacent values

Question

I would like to split my Dataframe into train and test, but the test set should contain for example 3 adjacent rows from the whole data multiple times. I dont know how to write the Question properly, please just look at the tables. I want to split my Dataframe by blocks.

All Data:

Y	row_num	x1	x2
value	1	some value	some other value
value	2	some value	some other value
value	3	some value	some other value
value	4	some value	some other value
value	5	some value	some other value
value	6	some value	some other value
value	7	some value	some other value
value	8	some value	some other value
value	9	some value	some other value
value	10	some value	some other value
value	11	some value	some other value

What I want:

train:

Y	row_num	x1	x2
value	1	some value	some other value
value	5	some value	some other value
value	6	some value	some other value
value	10	some value	some other value
value	11	some value	some other value

test:

Y	row_num	x1	x2
value	2	some value	some other value
value	3	some value	some other value
value	4	some value	some other value
value	7	some value	some other value
value	8	some value	some other value
value	9	some value	some other value

score 0 · Answer 1 · answered Jun 02 '21 at 14:36

Would something like this be suitable?:

Generate some random indexes from the length of your dataframe:

random_sample = np.random.choice(np.arange(0, len(def)), n_samples)

example with 3 samples:

array([15, 16, 10])

Then add the next 3 numbers to each item in that list:

indexes = np.array(list(zip(*[random_sample + x for x in range(3)]))).flatten()

example:

array([15, 16, 17, 16, 17, 18, 10, 11, 12])

And index your dataframe:

df.iloc[indexes]

Whole Brain · Accepted Answer · 2021-06-03T09:05:43.333

There might be a more elegant and/or efficient way to accomplish your goal. I have yet no solution in mind to randomly pick a fixed number of sets of n consecutives elements in a list (without replacement).

I would probably start by doing something like this, though:

import random
def custom_split(df, train_size, n_adjacent=3):
    # Number of desired sets of n_adjacent consecutive rows.
    test_size = int(len(df)*(1-train_size)//n_adjacent)
    n_attempt = 10
    while n_attempt > 0:
        retry = False
        available_idx = list(range(len(df)))
        test_idx = []
        for _ in range(test_size):
            # If no more consecutive indices, it will try again from the beginning.
            if len(available_idx) < n_adjacent:
                retry = True
                n_attempt -= 1
                break
            # Choosing an idx from the available ones .
            add_idx = random.choice(available_idx[:-(n_adjacent-1)])
            # Extending with this indice and the two following ones.
            new_idx = list(range(add_idx, add_idx + n_adjacent))
            # Removing those indices from the available list,
            # also removing indices that are no more 
            # part of n_adjacent consecutive ones.
            available_idx = [idx for idx in available_idx if idx not in new_idx \
                             and idx + n_adjacent - 1 not in new_idx]
            test_idx.extend(new_idx)
        if not retry:
            # It succeeded.
            # Masking the test_idx as False.
            train_idx = np.ones(len(df), dtype=np.bool)
            train_idx[test_idx] = False
            return df.iloc[train_idx,:], df.iloc[test_idx,:]
    # Raises an exception if failed 10 times.
    raise Exception("Could not find consecutive indices to randomly choose from.")

# 80% train, 20% test, rounding up the train portion.
# Thanks to the mask, all the dataframe is represented.
train_set, test_set = custom_split(a_dataframe, train_size = 0.8, n_adjacent = 5)

The major issue with this solution is that you can end up lacking consecutive indices when calling random.choice. That's the reason for the while loop: it will try again as long as it fails up to 10 times, else it will raise an exception.

The "idx" are not from the index column in the DataFrame, they are instead the locations of the rows in there axe. That's why I use them with iloc and not with loc.

Result with a 20 rows DataFrame, 70% train_size and 3 n_adjacent:

# IDX
# train:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16]
# test:
[17, 18, 19, 9, 10, 11]

Don't forget to shuffle the train set or both the sets afterwards, according to your needs. Here is an elegant way to shuffle DataFrames rows : https://stackoverflow.com/a/34879805/10409093

python Samples of a Dataframe, like pandas Dataframe.sample() but always selecting n adjacent values

2 Answers2