Populating an even distribution of values across multiple axis?

Question

Basic Example:

# Given params such as:
params = {
    'cols': 8,
    'rows': 4, 
    'n': 4
}
# I'd like to produce (or equivalent):
       col0  col1  col2  col3  col4  col5  col6  col7
row_0     0     1     2     3     0     1     2     3
row_1     1     2     3     0     1     2     3     0
row_2     2     3     0     1     2     3     0     1
row_3     3     0     1     2     3     0     1     2

Axis Value Counts:

Where the axis all have an equal distribution of values

df.apply(lambda x: x.value_counts(), axis=1)

       0  1  2  3
row_0  2  2  2  2
row_1  2  2  2  2
row_2  2  2  2  2
row_3  2  2  2  2

df.apply(lambda x: x.value_counts())

   col0  col1  col2  col3  col4  col5  col6  col7
0     1     1     1     1     1     1     1     1
1     1     1     1     1     1     1     1     1
2     1     1     1     1     1     1     1     1
3     1     1     1     1     1     1     1     1

My attempt thus far:

import itertools
import pandas as pd

def create_df(cols, rows, n):
    x = itertools.cycle(list(itertools.permutations(range(n))))
    df = pd.DataFrame(index=range(rows), columns=range(cols))
    df[:] = np.reshape([next(x) for _ in range((rows*cols)//n)], (rows, cols))
    #df = df.T.add_prefix('row_').T
    #df = df.add_prefix('col_')
    return df 

params = {
    'cols': 8,
    'rows': 4, 
    'n': 4
}
df = create_df(**params)

Output:

   0  1  2  3  4  5  6  7
0  0  1  2  3  0  1  3  2
1  0  2  1  3  0  2  3  1
2  0  3  1  2  0  3  2  1
3  1  0  2  3  1  0  3  2

# Correct on this Axis:
>>> df.apply(lambda x: x.value_counts(), axis=1)
   0  1  2  3
0  2  2  2  2
1  2  2  2  2
2  2  2  2  2
3  2  2  2  2

# Incorrect on this Axis:
>>> df.apply(lambda x: x.value_counts())
     0  1    2    3    4  5    6    7
0  3.0  1  NaN  NaN  3.0  1  NaN  NaN
1  1.0  1  2.0  NaN  1.0  1  NaN  2.0
2  NaN  1  2.0  1.0  NaN  1  1.0  2.0
3  NaN  1  NaN  3.0  NaN  1  3.0  NaN

So, I have the conditions I need on one axis, but not on the other.

How can I update my method/create a method to meet both conditions?

Hugh Mungus · Answer 1 · 2022-07-14T08:49:22.170

You can use numpy.roll:

def create_df(cols, rows, n):
    x = itertools.cycle(range(n))
    arr = [np.roll([next(x) for _ in range(cols)], -i) for i in range(rows)]
    return pd.DataFrame(arr)

Output (with given test case):

   0  1  2  3  4  5  6  7
0  0  1  2  3  0  1  2  3
1  1  2  3  0  1  2  3  0
2  2  3  0  1  2  3  0  1
3  3  0  1  2  3  0  1  2

Edit: In Python 3.8+ you can use the := operator (which is significantly faster than my answer above):

def create_df(cols, rows, n):
    x = itertools.cycle(range(n))
    n = [next(x) for _ in range(cols)]
    arr = [n := n[1:]+n[:1] for _ in range(rows)]
    return pd.DataFrame(arr)

Output (again with given test case):

    0   1   2   3   4   5   6   7
0   1   2   3   0   1   2   3   0
1   2   3   0   1   2   3   0   1
2   3   0   1   2   3   0   1   2
3   0   1   2   3   0   1   2   3

score 1 · Accepted Answer · answered Jul 12 '22 at 19:54

1

You can tile you input and use a custom roll to shift each row independently:

c = params['cols']
r = params['rows']
n = params['n']
a = np.arange(params['n']) # or any input

b = np.tile(a, (r, c//n))
# array([[0, 1, 2, 3, 0, 1, 2, 3],
#        [0, 1, 2, 3, 0, 1, 2, 3],
#        [0, 1, 2, 3, 0, 1, 2, 3],
#        [0, 1, 2, 3, 0, 1, 2, 3]])

idx = np.arange(r)[:, None]
shift = (np.tile(np.arange(c), (r, 1)) - np.arange(r)[:, None])

df = pd.DataFrame(b[idx, shift])

Output:

   0  1  2  3  4  5  6  7
0  0  1  2  3  0  1  2  3
1  3  0  1  2  3  0  1  2
2  2  3  0  1  2  3  0  1
3  1  2  3  0  1  2  3  0

Alternative order:

idx = np.arange(r)[:, None]
shift = (np.tile(np.arange(c), (r, 1)) + np.arange(r)[:, None]) % c

df = pd.DataFrame(b[idx, shift])

Output:

   0  1  2  3  4  5  6  7
0  0  1  2  3  0  1  2  3
1  1  2  3  0  1  2  3  0
2  2  3  0  1  2  3  0  1
3  3  0  1  2  3  0  1  2

Other alternative: use a custom strided_indexing_roll function.

answered Jul 12 '22 at 19:54

mozway

194,879
13
39
75

There are cases where this doesn't appear to work. For example, with `c=20, r=10, n=6` I get `IndexError: index 18 is out of bounds for axis 1 with size 18` – BeRT2me Jul 13 '22 at 15:10
@BeRT2me it should be expected, 20 is not a multiple of 6. If you have a non multiple and you repeat the values, you won't have equal counts. (Maybe your example is misleading?) – mozway Jul 13 '22 at 15:12
Oh, interesting. I guess it's an issue of (me) not fully understanding what `np.tile` does. But, having a check for actual validity of output is certainly a desirable attribute! – BeRT2me Jul 13 '22 at 15:16
1

Yes, checking validity is important, you can add a quick check that `c%n == 0` ;) – mozway Jul 13 '22 at 15:18

Populating an even distribution of values across multiple axis?

2 Answers2