5

Consider the dataframe containing N columns as shown below. Each entry is an 8-bit integer.

|---------------------|------------------|---------------------|
|      Column 1       |     Column 2     |      Column N       |
|---------------------|------------------|---------------------|
|          4          |         8        |          13         |
|---------------------|------------------|---------------------|
|          0          |         32       |          16         |
|---------------------|------------------|---------------------|

I'd like to create a new column with 8-bit entries in each row by randomly sampling each bit of data from the remaining columns. So, the resulting dataframe would look like:

|---------------------|------------------|---------------------|---------------|
|      Column 1       |     Column 2     |      Column N       |     Sampled   |
|---------------------|------------------|---------------------|---------------|
|      4 = (100)      |     8 = (1000)   |    13 = (1101)      |   5 = (0101)  |
|---------------------|------------------|---------------------|---------------|
|      0 = (0)        |    32 = (100000) |   16 = (10000)      | 48 = (110000) |
|---------------------|------------------|---------------------|---------------|

The first entry in the "sampled" column was created by selecting one bit among all possible bits for the same position. For example, the LSB=1 in the first entry was chosen from {0 (LSB from col 1), 0 (LSB from col 2), 1 (LSB from col N)}, and so on.

This is similar to this question but instead of each entry being sampled, each bit needs to be sampled.

What is an efficient way of achieving this, considering the dataframe has a large number of rows and columns? From the similar question, I assume we need a lookup + sample to choose the entry and another sample to choose the bits?

1 Answers1

2

Same logic like before when you do the sample , but here I convert between the binary and decimal twice, with unnesting , then join back the result

df1=df.applymap(lambda x : list('{0:08b}'.format(x)))

df1=unnesting(df1,df1.columns.tolist())
s=np.random.randint(0, df1.shape[1], df1.shape[0])

yourcol=pd.Series(df1.values[np.arange(len(df1)),s]).groupby(df1.index).apply(''.join)

df['Sampled']=yourcol.map(lambda x : int(x,2))

df
Out[268]: 
   c1  c2  cn  Sampled
0   4   8  13       12
1   0  32  16       16

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234