Create new column by random sampling of other columns data

Question

I'd like to create a new column by randomly sampling data from the remaining columns.

Consider a dataframe with "N" columns as follows:

|---------------------|------------------|---------------------|
|      Column 1       |     Column 2     |      Column N       |
|---------------------|------------------|---------------------|
|          0.37       |         0.8      |          0.0        |
|---------------------|------------------|---------------------|
|          0.0        |         0.0      |          0.8        |
|---------------------|------------------|---------------------|

The resulting dataframe should look like

|---------------------|------------------|---------------------|---------------|
|      Column 1       |     Column 2     |      Column N       |     Sampled   |
|---------------------|------------------|---------------------|---------------|
|          0.37       |         0.8      |          0.0        |       0.8     |
|---------------------|------------------|---------------------|---------------|
|          0.0        |         0.0      |          B          |        B      |
|---------------------|------------------|---------------------|---------------|
|          A          |         5        |          0.8        |        A      |
|---------------------|------------------|---------------------|---------------|

The "Sampled" column's entries are created by randomly choosing one of the corresponding entries of the "N" columns. For example, "0.8" was chosen from Column 2, "B" from Column N, and so on.

df.sample(axis=1) simply chooses one column and returns it. This is NOT what I want.

What would be the fastest way to achieve this? The method needs to be efficient as the original dataframe is big with lots of rows and columns.

score 6 · Answer 1 · answered Apr 09 '19 at 17:42

You can use the underlying numpy array and select a random index per row.

u = df.values
r = np.random.randint(0, u.shape[1], u.shape[0])

df.assign(Sampled=u[np.arange(u.shape[0]), r])

  Column 1  Column 2 Column N Sampled
0     0.37       0.8      0.0    0.37
1      0.0       0.0        B       B
2        A       5.0      0.8       A

score 5 · Accepted Answer · answered Apr 09 '19 at 17:46

5

Pandas base lookup + sample

s=df.columns.to_series().sample(len(df),replace = True)
df['New']=df.lookup(df.index,s)
df
Out[177]: 
  Column1  Column2 ColumnN  New
0    0.37      0.8     0.0  0.8
1     0.0      0.0       B    B
2       A      5.0     0.8    A

answered Apr 09 '19 at 17:46

BENY

317,841
20
164
234

Note that [lookup has been deprecated](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.lookup.html) in pandas v1.2.0 and one should use [`melt`+`loc`](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-lookup) instead. – My Work May 20 '21 at 13:21

G. Anderson · Answer 3 · 2019-04-09T18:41:29.193

2

One option is to apply np.random.choice to to the dataframe, along the rows. This may or may not give you the performance you require, but I leave that up to you to decide

Setup: DF with 4 columns, 11000 rows

df=pd.DataFrame({'a':[np.random.rand() for i in range(11000)],'b':[np.random.rand() for i in range(11000)],
                 'c':[np.random.rand() for i in range(11000)],'d':[np.random.rand() for i in range(11000)]})

%timeit df['e']=df.apply(lambda x: np.random.choice(x), axis=1)

193 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Additional benchmarks:

Adding x.values into the lambda appears to improve the speed by approximately 20%. However, @wen-ben's solution is a 100-fold improvement on this method on the same dataframe

1.91 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

By request, here is the timing for user3483203's answer, may be even better (I had to do some things to it to make it work with the timing magic, so ymmv)

%%timeit
df1=df.copy()
u = df.values
r = np.random.randint(0, u.shape[1], u.shape[0])

df1=df1.assign(Sampled=u[np.arange(u.shape[0]), r])

590 µs ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Apr 09 '19 at 18:41

answered Apr 09 '19 at 17:44

G. Anderson

5,815
2
14
21

would you like test the lookup method timing ? – BENY Apr 09 '19 at 17:47
Yeah, you definitely win! – G. Anderson Apr 09 '19 at 17:49
@G.Anderson Thank you for timing all the methods! – Harshavardhan Ramanna Apr 09 '19 at 18:06
1

@HarshavardhanRamanna you are very welcome! `%timeit/%%timeit` are very useful ipython magic functions to know of and become familiar with, especially if you use jupyter notebooks for testing. – G. Anderson Apr 09 '19 at 18:09
Mind adding the timings for mine? – user3483203 Apr 09 '19 at 18:29
Wow, `df.assign` _really_ doesn't like to play well with `%%timeit` haha! That said, I think we have a winner. Editing now. – G. Anderson Apr 09 '19 at 18:39
Assign should work with timeit, it doesn't modify the dataframe, but returns a copy, it works for me in IPython. Also, in your test code, you make a copy of the DataFrame, but still use `df.values` for `u`, which seems to defeat the purpose of making a copy. – user3483203 Apr 09 '19 at 18:52
Without the `df.copy`, for some reason it threw a `reference before assignment` error. No idea why! – G. Anderson Apr 09 '19 at 19:17

score 2 · Answer 4 · answered Apr 09 '19 at 17:45

2

from random import choice
df['sample'] =  df.apply(lambda x:choice(x.values),axis =1)

answered Apr 09 '19 at 17:45

Akhilesh_IN

1,217
1
13
19

You solution with `.values` improves on my efficiency by approximately 30ms on my dataset, nice! – G. Anderson Apr 09 '19 at 17:53

Create new column by random sampling of other columns data

4 Answers4

Linked

Related