Pandas: shifting columns in grouped dataframe if NaN

Question

I have a grouped dataframe like so:

          │          product          │          category           
          │   spot1   spot2    spot3  │   spot1    spot2    spot3 
──────────┼───────────────────────────┼─────────────────────────────
 basket 1 │   NaN     apple    banana │    NaN     fruits   fruits
 basket 2 │   almond  carrot   NaN    │    nuts    veggies  NaN

One row represents a "basket" containing different food products (vegtables, fruits, nuts).

Each basket has 3 spots that can either contain a food product or not (=NaN).

I would like the first column of group product to be as populated as possible. That means if there is a NaN value in the first column of the product group and some value in the 2nd or n-th column if should shift to the left for each group.

Categories are related: in the example above a baskets' spot1 of group product and spot1 of group category belong together. Every data combination must have a value for product. If product is NaN then all the related items will be NaN as well.

The output should look something like:

          │          product          │          category           
          │   spot1   spot2    spot3  │   spot1    spot2    spot3 
──────────┼───────────────────────────┼─────────────────────────────
 basket 1 │   apple   banana   NaN    │    fruits   fruits   NaN  <-- this row shifted to left to "fill" first spot of product group
 basket 2 │   almond  carrot   NaN    │    nuts    veggies  NaN

jezrael's answer here was a good starting point for me:

#for each row remove NaNs and create new Series - rows in final df 
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)

However, this solution ignores grouping. I only want values to shift left based on the specific group product.

edit / minimal reproducible example

please use this code to get to the "starting point" of problem. The way I get to this point in my production code is more complex but this should do fine.

# Import pandas library
import pandas as pd
  
# initialize list of lists
data = [[1, 'NaN','NaN'], [1, 'apple','fruits'], [1,'banana', 'fruits'], [2, 'carrot','veggies'], [2, 'almond','nuts']]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['basket','product', 'category'])
  
# print dataframe.
df

dfg = df.groupby(['basket', df.groupby(['basket']).cumcount() + 1]).first().unstack().reset_index()

print(dfg)

hello please add minimal reproductible example https://stackoverflow.com/help/minimal-reproducible-example that would be helpful — grymlin, Sep 22 '22 at 11:39

ouroboros1 · Accepted Answer · 2022-09-23T09:57:41.923

I trust there is an easier way to accomplish this, but the following should work.

Setup

import pandas as pd
import numpy as np

# make sure that `NaNs` are *actual* `NaNs`, e.g. not "NaN" (this is a string 
# like any other); or use `[1, None, None]`
data = [[1, np.nan,np.nan], 
        [1, 'apple','fruits'], 
        [1,'banana', 'fruits'], 
        [2, 'carrot','veggies'], 
        [2, 'almond','nuts']]
  
df = pd.DataFrame(data, columns=['basket','product', 'category'])
dfg = df.groupby(['basket', df.groupby(['basket']).cumcount() + 
                  1]).first().unstack().reset_index()

print(dfg)

  basket product                 category                
               1       2       3        1       2       3
0      1    None   apple  banana     None  fruits  fruits
1      2  carrot  almond     NaN  veggies    nuts     NaN

Code

dfg.set_index('basket',drop=True, inplace=True)
out = dfg.unstack().dropna()

out.index = pd.MultiIndex.from_arrays([
    out.index.get_level_values(0),
    out.groupby(level=[2,0]).cumcount().add(1).to_numpy(),
    out.index.get_level_values(2)])

out = out.reset_index(drop=False).pivot(index='basket', 
                                        columns=['level_0','level_1'],
                                        values=0)\
    .reindex(dfg.columns, axis='columns').reset_index(drop=False)

print(out)

  basket product             category            
               1       2   3        1       2   3
0      1   apple  banana NaN   fruits  fruits NaN
1      2  carrot  almond NaN  veggies    nuts NaN

Explanation

First, we use df.unstack() with Series.dropna to get a Series with a MultiIndex that consists of col level 0, col level 1, index. I.e.:

out = df.unstack().dropna()
print(out.head(4))

            basket
product  1  2         carrot
         2  1          apple
            2         almond
         3  1         banana

Next, we use df.groupby on levels 0,2 (i.e. original col level 0 and basket), and we use cumcount to get consecutive numbers for the items in each group (adding 1: add(1). I.e. we are doing:

print(out.groupby(level=[2,0]).cumcount().add(1).to_numpy())

[1 1 2 2 1 1 2 2]

We use this result inside pd.MultiIndex.from_arrays to overwrite the MultiIndex (specifically level 1) with a new index. I.e. we now have:

print(out.head(4))

            basket
product  1  2         carrot
            1          apple
         2  2         almond
            1         banana

Now, finally, we can reset the index and use df.pivot to change the shape of out so that it matches the shape of the original df. Chaining df.reindex applied to the columns will both reset the order of the columns and add all the missing columns (e.g. 3 at col level 1 for both values in col level 0, and they will be automatically filled with NaNs.

+1 for the explanations. your code works with your MRE, but if I use it on the one provided by me it results in all NaNs... — linus, Sep 23 '22 at 08:57
Edited the answer to make it work on your sample. Let me know if you still experience problems. — ouroboros1, Sep 23 '22 at 09:58
one more question @ouroboros1. It seems like your solution is tailored to my MRE with exactly 3 "spots". Lets say I have n possible spots per group, how could that be handled? — linus, Sep 23 '22 at 12:14
Adding more columns shouldn't matter. Do make sure that `level 1` of your columns contains *only* consecutive digits. Check: `print(dfg.columns.get_level_values(1))`. For this example: `Index(['', 1, 2, 3, 1, 2, 3], dtype='object')`. The first empty string is OK, that's `basket`, but all the other ones should be consecutive digits per main category in `col level 0` (i.e. `product`, `category`). I.e. what we are trying to do, is change `[1, 2, 2, 3, 1, 2, 2, 3]` (idx level 1 for `out`) into `[1, 1, 2, 2, 1, 1, 2, 2]`. So these last values should *always* be a subset of the previous set. — ouroboros1, Sep 23 '22 at 12:43

Pandas: shifting columns in grouped dataframe if NaN

edit / minimal reproducible example

1 Answers1