3

I'd like to create a new column by randomly sampling data from the remaining columns.

Consider a dataframe with "N" columns as follows:

|---------------------|------------------|---------------------|
|      Column 1       |     Column 2     |      Column N       |
|---------------------|------------------|---------------------|
|          0.37       |         0.8      |          0.0        |
|---------------------|------------------|---------------------|
|          0.0        |         0.0      |          0.8        |
|---------------------|------------------|---------------------|

The resulting dataframe should look like

|---------------------|------------------|---------------------|---------------|
|      Column 1       |     Column 2     |      Column N       |     Sampled   |
|---------------------|------------------|---------------------|---------------|
|          0.37       |         0.8      |          0.0        |       0.8     |
|---------------------|------------------|---------------------|---------------|
|          0.0        |         0.0      |          B          |        B      |
|---------------------|------------------|---------------------|---------------|
|          A          |         5        |          0.8        |        A      |
|---------------------|------------------|---------------------|---------------|

The "Sampled" column's entries are created by randomly choosing one of the corresponding entries of the "N" columns. For example, "0.8" was chosen from Column 2, "B" from Column N, and so on.

df.sample(axis=1) simply chooses one column and returns it. This is NOT what I want.

What would be the fastest way to achieve this? The method needs to be efficient as the original dataframe is big with lots of rows and columns.

4 Answers4

6

You can use the underlying numpy array and select a random index per row.

u = df.values
r = np.random.randint(0, u.shape[1], u.shape[0])

df.assign(Sampled=u[np.arange(u.shape[0]), r])

  Column 1  Column 2 Column N Sampled
0     0.37       0.8      0.0    0.37
1      0.0       0.0        B       B
2        A       5.0      0.8       A
user3483203
  • 50,081
  • 9
  • 65
  • 94
5

Pandas base lookup + sample

s=df.columns.to_series().sample(len(df),replace = True)
df['New']=df.lookup(df.index,s)
df
Out[177]: 
  Column1  Column2 ColumnN  New
0    0.37      0.8     0.0  0.8
1     0.0      0.0       B    B
2       A      5.0     0.8    A
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Note that [lookup has been deprecated](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.lookup.html) in pandas v1.2.0 and one should use [`melt`+`loc`](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-lookup) instead. – My Work May 20 '21 at 13:21
2

One option is to apply np.random.choice to to the dataframe, along the rows. This may or may not give you the performance you require, but I leave that up to you to decide

Setup: DF with 4 columns, 11000 rows

df=pd.DataFrame({'a':[np.random.rand() for i in range(11000)],'b':[np.random.rand() for i in range(11000)],
                 'c':[np.random.rand() for i in range(11000)],'d':[np.random.rand() for i in range(11000)]})

%timeit df['e']=df.apply(lambda x: np.random.choice(x), axis=1)

193 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Additional benchmarks:

Adding x.values into the lambda appears to improve the speed by approximately 20%. However, @wen-ben's solution is a 100-fold improvement on this method on the same dataframe

1.91 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

By request, here is the timing for user3483203's answer, may be even better (I had to do some things to it to make it work with the timing magic, so ymmv)

%%timeit
df1=df.copy()
u = df.values
r = np.random.randint(0, u.shape[1], u.shape[0])

df1=df1.assign(Sampled=u[np.arange(u.shape[0]), r])

590 µs ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
G. Anderson
  • 5,815
  • 2
  • 14
  • 21
2
from random import choice
df['sample'] =  df.apply(lambda x:choice(x.values),axis =1)
Akhilesh_IN
  • 1,217
  • 1
  • 13
  • 19