Pandas random n samples of consecutive rows / pairs

Question

I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows

e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.

Additional condition: value difference can't exceed 4000.

index       value   distance
cg13869341  15865   1.635450
cg14008030  18827   4.161332

Then distance of the following etc

cg20826792  29425   0.657369
cg33045430  29407   1.708055

Sample original dataframe

index       value   distance
cg13869341  15865   1.635450
cg14008030  18827   4.161332
cg12045430  29407   0.708055
cg20826792  29425   0.657369
cg33045430  69407   1.708055
cg40826792  59425   0.857369
cg47454306  88407   0.708055
cg60826792  96425   2.857369

I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample

original shape is (480136, 14)

Do you need exactly `2*n` rows? What is the original shape of the DataFrame? Is "index" a column or the real index? — mozway, Oct 25 '22 at 19:59
the original shape is `(480136, 14)`. `2*n` or `n/2`. `index` is the actual index. I was just highlighting that. Yes, the main thing for me is I need to sample in pairs of rows and perform additional calculations. no duplicates... — MonteCristo, Oct 25 '22 at 20:08
And would it be fine to always have a pair (odd, even) (never (even, odd))? I'm trying to think of the best strategy here ;) — mozway, Oct 25 '22 at 20:14
It doesn't matter if it's `(odd, even)` or `(even, odd)`. Just need it to be random pairs. I understand :). — MonteCristo, Oct 25 '22 at 20:17

mozway · Accepted Answer · 2022-10-25T20:39:21.210

2

If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:

N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index

# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)

# select those OR the next ones 
df_sample = df.loc[m|m.shift()]

Example output on the toy DataFrame (N=3):

        index  value  distance
2  cg12045430  29407  0.708055
3  cg20826792  29425  0.657369
4  cg33045430  69407  1.708055
5  cg40826792  59425  0.857369
6  cg47454306  88407  0.708055
7  cg60826792  96425  2.857369

increasing randomness

The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:

N = 20000
frac = 0.2

idx = (df
   .drop(df.sample(frac=frac).index)
   .loc[::2].sample(n=N)
   .index
 )

m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]

# check:
# len(df_sample)
# 40000

edited Oct 25 '22 at 20:39

answered Oct 25 '22 at 20:21

mozway

194,879
13
39
75

there was one more thing. I just remembered. value difference can't exceed 4000. theoretically, it shouldn't happen because of the nature of the data. However, practically it could happen. – MonteCristo Oct 25 '22 at 20:23
I think you're missing a dot in the `m=` line, but more interestingly, what on earth is going on in the last line, with the pipe operator? – butterflyknife Oct 25 '22 at 20:23
Yes I fixed a few typos (typing from my phone) – mozway Oct 25 '22 at 20:24
1

The pipe is a boolean OR to combine the mask `m` with it's shifted version to select the rows below those randomly chosen in the first step. – mozway Oct 25 '22 at 20:28
Regarding the diff < 4000 constraint you can pre-filter to remove those cases before the random selection (please provide code to set up a large reproducible example if you need help with that) – mozway Oct 25 '22 at 21:01
pre-filtering would not work in this instance. value is positional / neighbours. So if I remove that they won't be neighbors/pairs anymore. – MonteCristo Oct 26 '22 at 10:42

butterflyknife · Answer 2 · 2022-10-25T21:21:23.953

1

Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).

import random

# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)

# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))

# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]

# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1

# Filter
df_sample = df[df.index.isin(c)]

# Restore original index if required.
df = df.set_index("index")

Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array

edited Oct 25 '22 at 21:21

answered Oct 25 '22 at 20:53

butterflyknife

1,438
8
17

This does allow overlapping pairs (1,2)/(2,3), which is not wanted. – mozway Oct 25 '22 at 20:59
@mozway where does the question say that? – butterflyknife Oct 25 '22 at 21:21
In the [comments](https://stackoverflow.com/questions/74199513/pandas-random-n-samples-of-consecutive-rows-pairs/74200069?noredirect=1#comment131004292_74199513) – mozway Oct 25 '22 at 21:22
@mozway ugh, gets me every time. Ho hum, ain't nobody got time to hunt out specifications hidden away like that :-D – butterflyknife Oct 25 '22 at 21:24
can you tell me what `c` is? there is an error in there – MonteCristo Nov 15 '22 at 23:21

Pandas random n samples of consecutive rows / pairs

2 Answers2

increasing randomness