2

I have this dataframe (with all 50 states and more categories, but if we can get this to work, I can apply it to the whole dataset):

                   US     US_bp   US_bp%       AL   AL_bp  AL_bp%
total_pop   324173084  41393176     12.8  4920613  794326    16.1
white       198511109  19510415      9.8  3218517  378269    11.8
black        38526055   8402643     21.8  1260356  327284    26.0

And need this:

    place  total_pop        bp    bp%  white_pop  white_bp  white_bp%  black_pop  black_bp  black_bp%
0      US  324173084  41393176   12.8  198511109  19510415        9.8    8402643    840263       21.4
1      AL    4920613    794326   16.1    3218517    378269       11.8    1260356    327284       26.0

How can I do this with Pandas? I tried melt and pivot, but can't wrap my head around what might work.

neendoniss
  • 23
  • 4
  • Welcome to Stack Overflow! Check out the [tour]. Writing a more specific title might help get more eyes on this. Check out [How to ask a good question](/help/how-to-ask). – wjandrea Jul 17 '23 at 20:56

3 Answers3

3

Here's a partial solution: Convert the columns into a MultiIndex, then stack and unstack.

I'm using a regex for the first step: Extract two uppercase characters optionally followed by an underscore plus the other part.

col_pairs = df.columns.str.extract(r'^([A-Z]{2})(?:_(.*))?$').fillna('pop')
df.columns = pd.MultiIndex.from_arrays(col_pairs.T.values)

(df
    .stack(level=0)
    .unstack(level=0)
    .swaplevel(axis=1)
    # Fix order
    .reindex(df.index, axis=1, level=0)
    .reindex(df.columns.get_level_values(1).unique(), axis=1, level=1)
    .reindex(df.columns.get_level_values(0).unique())
    )
   total_pop                      white                     black               
         pop        bp   bp%        pop        bp   bp%       pop       bp   bp%
US 324173084  41393176  12.8  198511109  19510415   9.8  38526055  8402643  21.8
AL   4920613    794326  16.1    3218517    378269  11.8   1260356   327284  26.0

I'd prefer if stacking/unstacking didn't sort, but at least you can fix that by reindexing. I'm also not bothering to get the labels exactly as you want them, but most of that's easy.

Due credit to sammywemmy for the stack/unstack technique.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
1

Another solution, using pd.wide_to_long:

stubnames = sorted(set(c.split('_')[0] for c in df.columns))
df.columns = (c if '_' in c else c + '_total_pop' for c in df.columns)
df = df.rename(index={'total_pop': ''})

df = pd.wide_to_long(df.reset_index(), stubnames, i='index', j='xxx', sep='_', suffix=r'.*').T
df.columns = (f'{a}_{b}'.strip('_') for a, b in df.columns)

print(df)

Prints:

      total_pop  white_total_pop  black_total_pop          bp    white_bp   black_bp   bp%  white_bp%  black_bp%
AL    4920613.0        3218517.0        1260356.0    794326.0    378269.0   327284.0  16.1       11.8       26.0
US  324173084.0      198511109.0       38526055.0  41393176.0  19510415.0  8402643.0  12.8        9.8       21.8
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1
  • rename the columns to cater for those that dont have an ending with _:
df.columns = [f"{col}_pop" if "bp" not in col else col for col in df]
  • create a MultiIndex from the columns:
df.columns = df.columns.str.split("_", expand=True)
  • reshape the dataframe with a combination of stack, unstack, swaplevel:
df = df.stack(level=0).unstack(level=0).swaplevel(axis=1)
  • run a list comprehension to build the final column to your expected output:
df.columns = [first if last in first 
              else last if first == "total_pop" 
              else f"{first}_{last}" 
              for first, last in df]

df.index.name = 'place'

       black_bp        bp  white_bp  black_bp%   bp%  white_bp%  black_pop  total_pop  white_pop
place                                                                                           
AL       327284    794326    378269       26.0  16.1       11.8    1260356    4920613    3218517
US      8402643  41393176  19510415       21.8  12.8        9.8   38526055  324173084  198511109

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    You can add `.sort_index(axis=1, level=0)` after `swaplevel` to get the order closer to what OP wants – wjandrea Jul 18 '23 at 00:06