0

I have a dataframe mentioned below, which is the result of pivoting a datarame using pd.pivot_table passing parameter agg_func as sum:

                                          counts           
RACE                   BLACK OR AFRICAN AMERICAN WHITE  All
ETHNIC                                                     
HISPANIC OR LATINO                            11    41   52
NOT HISPANIC OR LATINO                        15    71   86
All                                           26   112  138

You can run the code below to load above dataframe into a variable df:

df = pd.DataFrame.from_dict({('counts', 'BLACK OR AFRICAN AMERICAN'): {'HISPANIC OR LATINO': 11, 'NOT HISPANIC OR LATINO': 15, 'All': 26}, ('counts', 'WHITE'): {'HISPANIC OR LATINO': 41, 'NOT HISPANIC OR LATINO': 71, 'All': 112}, ('counts', 'All'): {'HISPANIC OR LATINO': 52, 'NOT HISPANIC OR LATINO': 86, 'All': 138}}).rename_axis((None, 'RACE'), axis=1).rename_axis((['ETHNIC']), axis=0)

I'm trying to transform this dataframe as mentioned below:

Expected Output:

  level varName                      value counts
0     2  ETHNIC         HISPANIC OR LATINO    52
1     1    RACE  BLACK OR AFRICAN AMERICAN    11
2     1    RACE                      WHITE    41
3     2  ETHNIC     NOT HISPANIC OR LATINO    86
4     1    RACE  BLACK OR AFRICAN AMERICAN    15
5     1    RACE                      WHITE    71

The varName field above represents the axis names for columns and rows, level is 1 for column axis and 2 for row axis, the row with index value All is optional, so it doesn't matter whether it is there in the resulting dataframe or not (Later I thought it'd be better if it's there)

I already looked into following SO threads but I didn't find any of them much relevant to my problem.

One way I'm able to transform is by manually picking up the values and creating the required dataframe by hand like this:

df.index.names
# output: FrozenList(['ETHNIC'])
df.columns.names
#output: FrozenList([None, 'RACE'])
[y for x,y in df][:-1]
#output: ['BLACK OR AFRICAN AMERICAN', 'WHITE']
[x for x in df.index][:-1]
#output: ['HISPANIC OR LATINO', 'NOT HISPANIC OR LATINO']

EDIT:

Here is the data before creating pivot table:

data = pd.DataFrame.from_dict({'ETHNIC': {0: 'NOT HISPANIC OR LATINO', 1: 'NOT HISPANIC OR LATINO', 2: 'NOT HISPANIC OR LATINO', 3: 'NOT HISPANIC OR LATINO', 4: 'NOT HISPANIC OR LATINO', 5: 'NOT HISPANIC OR LATINO', 6: 'NOT HISPANIC OR LATINO', 7: 'NOT HISPANIC OR LATINO', 8: 'NOT HISPANIC OR LATINO', 9: 'HISPANIC OR LATINO', 10: 'HISPANIC OR LATINO', 11: 'HISPANIC OR LATINO', 12: 'HISPANIC OR LATINO', 13: 'HISPANIC OR LATINO', 14: 'NOT HISPANIC OR LATINO', 15: 'NOT HISPANIC OR LATINO', 16: 'NOT HISPANIC OR LATINO', 17: 'NOT HISPANIC OR LATINO', 18: 'NOT HISPANIC OR LATINO', 19: 'NOT HISPANIC OR LATINO', 20: 'NOT HISPANIC OR LATINO', 21: 'HISPANIC OR LATINO', 22: 'HISPANIC OR LATINO', 23: 'NOT HISPANIC OR LATINO', 24: 'NOT HISPANIC OR LATINO', 25: 'NOT HISPANIC OR LATINO', 26: 'HISPANIC OR LATINO', 27: 'HISPANIC OR LATINO', 28: 'HISPANIC OR LATINO', 29: 'HISPANIC OR LATINO', 30: 'HISPANIC OR LATINO', 31: 'HISPANIC OR LATINO', 32: 'NOT HISPANIC OR LATINO', 33: 'HISPANIC OR LATINO', 34: 'NOT HISPANIC OR LATINO', 35: 'NOT HISPANIC OR LATINO', 36: 'NOT HISPANIC OR LATINO', 37: 'NOT HISPANIC OR LATINO', 38: 'NOT HISPANIC OR LATINO', 39: 'NOT HISPANIC OR LATINO', 40: 'NOT HISPANIC OR LATINO', 41: 'NOT HISPANIC OR LATINO', 42: 'HISPANIC OR LATINO', 43: 'NOT HISPANIC OR LATINO', 44: 'NOT HISPANIC OR LATINO', 45: 'NOT HISPANIC OR LATINO', 46: 'HISPANIC OR LATINO', 47: 'HISPANIC OR LATINO', 48: 'HISPANIC OR LATINO', 49: 'HISPANIC OR LATINO', 50: 'NOT HISPANIC OR LATINO', 51: 'NOT HISPANIC OR LATINO', 52: 'NOT HISPANIC OR LATINO', 53: 'HISPANIC OR LATINO', 54: 'HISPANIC OR LATINO', 55: 'HISPANIC OR LATINO', 56: 'NOT HISPANIC OR LATINO', 57: 'HISPANIC OR LATINO', 58: 'HISPANIC OR LATINO', 59: 'NOT HISPANIC OR LATINO', 60: 'NOT HISPANIC OR LATINO', 61: 'HISPANIC OR LATINO', 62: 'HISPANIC OR LATINO', 63: 'HISPANIC OR LATINO', 64: 'HISPANIC OR LATINO', 65: 'NOT HISPANIC OR LATINO', 66: 'NOT HISPANIC OR LATINO', 67: 'NOT HISPANIC OR LATINO', 68: 'NOT HISPANIC OR LATINO', 69: 'HISPANIC OR LATINO', 70: 'NOT HISPANIC OR LATINO', 71: 'NOT HISPANIC OR LATINO', 72: 'HISPANIC OR LATINO', 73: 'HISPANIC OR LATINO', 74: 'HISPANIC OR LATINO', 75: 'NOT HISPANIC OR LATINO', 76: 'NOT HISPANIC OR LATINO', 77: 'NOT HISPANIC OR LATINO', 78: 'NOT HISPANIC OR LATINO', 79: 'NOT HISPANIC OR LATINO', 80: 'NOT HISPANIC OR LATINO', 81: 'NOT HISPANIC OR LATINO', 82: 'HISPANIC OR LATINO', 83: 'HISPANIC OR LATINO', 84: 'HISPANIC OR LATINO', 85: 'NOT HISPANIC OR LATINO', 86: 'HISPANIC OR LATINO', 87: 'HISPANIC OR LATINO', 88: 'HISPANIC OR LATINO', 89: 'NOT HISPANIC OR LATINO', 90: 'NOT HISPANIC OR LATINO', 91: 'NOT HISPANIC OR LATINO', 92: 'NOT HISPANIC OR LATINO', 93: 'NOT HISPANIC OR LATINO', 94: 'NOT HISPANIC OR LATINO', 95: 'HISPANIC OR LATINO', 96: 'HISPANIC OR LATINO', 97: 'HISPANIC OR LATINO', 98: 'NOT HISPANIC OR LATINO', 99: 'NOT HISPANIC OR LATINO', 100: 'NOT HISPANIC OR LATINO', 101: 'NOT HISPANIC OR LATINO', 102: 'NOT HISPANIC OR LATINO', 103: 'NOT HISPANIC OR LATINO', 104: 'NOT HISPANIC OR LATINO', 105: 'NOT HISPANIC OR LATINO', 106: 'NOT HISPANIC OR LATINO', 107: 'NOT HISPANIC OR LATINO', 108: 'NOT HISPANIC OR LATINO', 109: 'HISPANIC OR LATINO', 110: 'HISPANIC OR LATINO', 111: 'NOT HISPANIC OR LATINO', 112: 'NOT HISPANIC OR LATINO', 113: 'NOT HISPANIC OR LATINO', 114: 'NOT HISPANIC OR LATINO', 115: 'HISPANIC OR LATINO', 116: 'HISPANIC OR LATINO', 117: 'NOT HISPANIC OR LATINO', 118: 'HISPANIC OR LATINO', 119: 'HISPANIC OR LATINO', 120: 'NOT HISPANIC OR LATINO', 121: 'HISPANIC OR LATINO', 122: 'HISPANIC OR LATINO', 123: 'HISPANIC OR LATINO', 124: 'HISPANIC OR LATINO', 125: 'HISPANIC OR LATINO', 126: 'NOT HISPANIC OR LATINO', 127: 'NOT HISPANIC OR LATINO', 128: 'NOT HISPANIC OR LATINO', 129: 'NOT HISPANIC OR LATINO', 130: 'NOT HISPANIC OR LATINO', 131: 'NOT HISPANIC OR LATINO', 132: 'NOT HISPANIC OR LATINO', 133: 'NOT HISPANIC OR LATINO', 134: 'NOT HISPANIC OR LATINO', 135: 'NOT HISPANIC OR LATINO', 136: 'NOT HISPANIC OR LATINO', 137: 'NOT HISPANIC OR LATINO'}, 'RACE': {0: 'WHITE', 1: 'WHITE', 2: 'WHITE', 3: 'WHITE', 4: 'WHITE', 5: 'WHITE', 6: 'WHITE', 7: 'WHITE', 8: 'WHITE', 9: 'BLACK OR AFRICAN AMERICAN', 10: 'BLACK OR AFRICAN AMERICAN', 11: 'BLACK OR AFRICAN AMERICAN', 12: 'BLACK OR AFRICAN AMERICAN', 13: 'BLACK OR AFRICAN AMERICAN', 14: 'WHITE', 15: 'WHITE', 16: 'WHITE', 17: 'WHITE', 18: 'WHITE', 19: 'WHITE', 20: 'BLACK OR AFRICAN AMERICAN', 21: 'WHITE', 22: 'WHITE', 23: 'WHITE', 24: 'BLACK OR AFRICAN AMERICAN', 25: 'BLACK OR AFRICAN AMERICAN', 26: 'WHITE', 27: 'WHITE', 28: 'WHITE', 29: 'WHITE', 30: 'WHITE', 31: 'WHITE', 32: 'WHITE', 33: 'WHITE', 34: 'WHITE', 35: 'WHITE', 36: 'WHITE', 37: 'WHITE', 38: 'WHITE', 39: 'WHITE', 40: 'BLACK OR AFRICAN AMERICAN', 41: 'BLACK OR AFRICAN AMERICAN', 42: 'WHITE', 43: 'WHITE', 44: 'WHITE', 45: 'WHITE', 46: 'WHITE', 47: 'WHITE', 48: 'WHITE', 49: 'WHITE', 50: 'WHITE', 51: 'BLACK OR AFRICAN AMERICAN', 52: 'BLACK OR AFRICAN AMERICAN', 53: 'WHITE', 54: 'WHITE', 55: 'WHITE', 56: 'WHITE', 57: 'WHITE', 58: 'WHITE', 59: 'WHITE', 60: 'WHITE', 61: 'WHITE', 62: 'WHITE', 63: 'WHITE', 64: 'WHITE', 65: 'BLACK OR AFRICAN AMERICAN', 66: 'BLACK OR AFRICAN AMERICAN', 67: 'BLACK OR AFRICAN AMERICAN', 68: 'BLACK OR AFRICAN AMERICAN', 69: 'WHITE', 70: 'WHITE', 71: 'WHITE', 72: 'WHITE', 73: 'WHITE', 74: 'BLACK OR AFRICAN AMERICAN', 75: 'WHITE', 76: 'WHITE', 77: 'WHITE', 78: 'WHITE', 79: 'WHITE', 80: 'BLACK OR AFRICAN AMERICAN', 81: 'BLACK OR AFRICAN AMERICAN', 82: 'BLACK OR AFRICAN AMERICAN', 83: 'BLACK OR AFRICAN AMERICAN', 84: 'BLACK OR AFRICAN AMERICAN', 85: 'BLACK OR AFRICAN AMERICAN', 86: 'WHITE', 87: 'WHITE', 88: 'WHITE', 89: 'WHITE', 90: 'WHITE', 91: 'WHITE', 92: 'WHITE', 93: 'WHITE', 94: 'WHITE', 95: 'WHITE', 96: 'WHITE', 97: 'WHITE', 98: 'WHITE', 99: 'WHITE', 100: 'WHITE', 101: 'WHITE', 102: 'WHITE', 103: 'WHITE', 104: 'WHITE', 105: 'WHITE', 106: 'WHITE', 107: 'WHITE', 108: 'BLACK OR AFRICAN AMERICAN', 109: 'WHITE', 110: 'WHITE', 111: 'WHITE', 112: 'WHITE', 113: 'WHITE', 114: 'WHITE', 115: 'BLACK OR AFRICAN AMERICAN', 116: 'BLACK OR AFRICAN AMERICAN', 117: 'WHITE', 118: 'WHITE', 119: 'WHITE', 120: 'WHITE', 121: 'WHITE', 122: 'WHITE', 123: 'WHITE', 124: 'WHITE', 125: 'WHITE', 126: 'WHITE', 127: 'WHITE', 128: 'WHITE', 129: 'WHITE', 130: 'WHITE', 131: 'WHITE', 132: 'WHITE', 133: 'WHITE', 134: 'WHITE', 135: 'WHITE', 136: 'WHITE', 137: 'WHITE'}})

And here is the code for pivoting:

df = (data.groupby(['ETHNIC', 'RACE'])
      .size()
      .to_frame('counts')
      .reset_index(level=['ETHNIC', 'RACE'])
      .pivot_table(index='ETHNIC', columns='RACE', aggfunc='sum', margins=True, dropna=False)
      )

PS: Please be noted that the order of the rows in the expected dataframe matters.

UPDATE:

As suggested in comment, I tried with pd.crosstab, and found that it is almost 2X slower to create the same aggregate df that I'm creating using pd.pivot_table (tested on a dataframe with 200K rows)

ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • 1
    @AnuragDabas, the order is not preseved, order of the values matter here, and also `level` column is not present. – ThePyGuy Jun 11 '21 at 03:04
  • @HenryEcker, it's not un-aggregation actually, it's just the transformation of aggregated data. And the transformed data is also some sort of aggregated data. – ThePyGuy Jun 11 '21 at 03:37
  • 2
    As a aside you can do `df = pd.crosstab(data['ETHNIC'], data['RACE'], margins=True)` to get to your current aggregate more simply. – Henry Ecker Jun 11 '21 at 03:43
  • 1
    @Don'tAccept I got a solution but that is not in order...i.e order is messed – Anurag Dabas Jun 11 '21 at 04:02

3 Answers3

1

One way and the only way I can come up with right now is this:

df_flat = pd.crosstab(data['ETHNIC'], data['RACE'])
l = []
for n,g in df_flat.stack().groupby(level=0):
    l.append(g.sum(level=0).rename('count').to_frame().assign(level=2, varname=g.index.names[0]))
    l.append(g.droplevel(level=0).rename('count').to_frame().assign(level=1, varname=g.index.names[1]))
df_out = pd.concat(l).reset_index()
df_out

Output:

                       index  count  level varname
0         HISPANIC OR LATINO     52      2  ETHNIC
1  BLACK OR AFRICAN AMERICAN     11      1    RACE
2                      WHITE     41      1    RACE
3     NOT HISPANIC OR LATINO     86      2  ETHNIC
4  BLACK OR AFRICAN AMERICAN     15      1    RACE
5                      WHITE     71      1    RACE

We could be able to fetch those level names ETHNIC and RACE from the multiIndex too.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    @Don'tAccept Yeah, thinking though this you are are flattening a multilevel index into separate rows per level. Not an easy 'transformation'. However, maybe I am missing something. – Scott Boston Jun 11 '21 at 04:25
  • I also got stuck due to multi level, I'm not used to it so often. – ThePyGuy Jun 11 '21 at 04:26
1

I managed to get it to work by using a helper function (with the raw data):

def agg_data(g):
    df_race = (
        g.groupby('RACE').size().to_frame('count')
        .rename_axis(index='value').reset_index()
        .assign(level=1, varName='RACE')
        [['level', 'varName', 'value', 'count']]
    )
    
    df_ethnic = (
        pd.DataFrame([[2, 'ETHNIC', g.ETHNIC.iloc[0], len(g)]], columns=df_race.columns)
    )
    
    return pd.concat([df_ethnic, df_race])

df.groupby(['ETHNIC']).apply(agg_data).reset_index(drop=True)


    level   varName value                       count
0   2       ETHNIC  HISPANIC OR LATINO          52
1   1       RACE    BLACK OR AFRICAN AMERICAN   11
2   1       RACE    WHITE                       41
3   2       ETHNIC  NOT HISPANIC OR LATINO      86
4   1       RACE    BLACK OR AFRICAN AMERICAN   15
5   1       RACE    WHITE                       71
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
1

The way I thought of this was:

df = (
    df.iloc[:-1][['All', *df.columns.difference(['All'])]]
        .stack()
        .reset_index(name='count')
        .rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
)

m = df['varName'].ne(df['varName'].shift())
df['value'] = np.where(m, df['varName'], df['value'])
df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1

df = df[['level', 'varName', 'value', 'count']]

df:

   level varName                      value  count
0      2  ETHNIC         HISPANIC OR LATINO     52
1      1    RACE  BLACK OR AFRICAN AMERICAN     11
2      1    RACE                      WHITE     41
3      2  ETHNIC     NOT HISPANIC OR LATINO     86
4      1    RACE  BLACK OR AFRICAN AMERICAN     15
5      1    RACE                      WHITE     71

First strip off bottom margin and re-order columns:

df.iloc[:-1][['All', *df.columns.difference(['All'])]]
RACE                    All  BLACK OR AFRICAN AMERICAN  WHITE
ETHNIC                                                       
HISPANIC OR LATINO       52                         11     41
NOT HISPANIC OR LATINO   86                         15     71

Then stack and rename:

df.iloc[:-1][['All', *df.columns.difference(['All'])]]
        .stack()
        .reset_index(name='count')
        .rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
                  varName                      value  count
0      HISPANIC OR LATINO                        All     52
1      HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     11
2      HISPANIC OR LATINO                      WHITE     41
3  NOT HISPANIC OR LATINO                        All     86
4  NOT HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     15
5  NOT HISPANIC OR LATINO                      WHITE     71

Then the rest based on a Boolean Index of varName:

m = df['varName'].ne(df['varName'].shift())
0     True
1    False
2    False
3     True
4    False
5    False
Name: varName, dtype: bool

Move varName over to value:

df['value'] = np.where(m, df['varName'], df['value'])
                  varName                      value  count
0      HISPANIC OR LATINO         HISPANIC OR LATINO     52
1      HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     11
2      HISPANIC OR LATINO                      WHITE     41
3  NOT HISPANIC OR LATINO     NOT HISPANIC OR LATINO     86
4  NOT HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     15
5  NOT HISPANIC OR LATINO                      WHITE     71

Assign level and varName based on m:

df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1
  varName                      value  count  level
0  ETHNIC         HISPANIC OR LATINO     52      2
1    RACE  BLACK OR AFRICAN AMERICAN     11      1
2    RACE                      WHITE     41      1
3  ETHNIC     NOT HISPANIC OR LATINO     86      2
4    RACE  BLACK OR AFRICAN AMERICAN     15      1
5    RACE                      WHITE     71      1

Lastly, re-order columns:

df = df[['level', 'varName', 'value', 'count']]
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • 1
    Actually I was away from my keyboard so I just got to sit down to take a crack at it properly. But it was a fun project to work through. – Henry Ecker Jun 11 '21 at 04:49