How to merge dataframe rows based on empty cells in a column

Question

If I have the following dataframe:

Index Col1 Col2 Col3
1     10   x    40
2          y    50
3          z    60
4     20   a    30

I would like to merge rows that have a blank Col1 with the previous row that is not blank in Col1.

Expected output:

Index Col1 Col2  Col3
1     10   x,y,z 40,50,60
4     20   a     30

Is this possible?

Thanks

Umar.H · Accepted Answer · 2021-11-04T19:23:06.200

2

Quite possible, what you need to do is create a unique group value that increments at every value that is not null.

in one go

df.drop('Col1',axis=1).groupby((df['Col1'].isna()==False).cumsum()).agg(list)
#p.s if really want strings use
#df.drop('Col1',axis=1).groupby((df['Col1'].isna()==False
#                      ).cumsum()).astype(str).agg(','.join)


           Col2          Col3
Col1
1     [x, y, z]  [40, 50, 60]
2           [a]          [30]

the key here is the condition:

df[['Col1']].assign(con=df['Col1'].isna()==False)

   Col1    con #for condition
0  10.0   True <-- first group
1   NaN  False
2   NaN  False
3  20.0   True <-- second group

now, creating a cumulative sum allows you to create your grouper object.

df[['Col1']].assign(con=(df['Col1'].isna()==False).cumsum())


   Col1  con
0  10.0    1
1   NaN    1
2   NaN    1
3  20.0    2

edited Nov 04 '21 at 19:23

answered Nov 04 '21 at 19:06

Umar.H

22,559
7
39
74

This one give me the same error as the other answer: 'Reindexing only valid with uniquely valued Index objects' on my dataset and I cannot figure out why. My col1 is alphanumeric IDs if that makes a difference? Thank – Pearl Nov 05 '21 at 11:01
@Pearl looks like you have a non unique index try resetting it first `df = df.reset_index(drop=True)` – Umar.H Nov 05 '21 at 15:37
I had tried that. I also printed it back to a csv file and the index looks normal so I am unsure what is happening – Pearl Nov 05 '21 at 16:10

score 2 · Answer 2 · answered Nov 04 '21 at 19:32

2

We can do

out = df.drop(labels = 'Col1',axis = 1).astype(str).groupby(df['Col1'].mask(df['Col1']=='').ffill()).agg(','.join).reset_index()
Out[85]: 
   Col1   Col2      Col3
0  10.0  x,y,z  40,50,60
1  20.0      a        30

answered Nov 04 '21 at 19:32

BENY

317,841
20
164
234

I like this one but I get an error 'Reindexing only valid with uniquely valued Index objects' on my dataset and cannot figure out why. My col1 is alphanumeric IDs if that makes a difference? Thanks – Pearl Nov 05 '21 at 10:48

score 0 · Answer 3 · answered Nov 05 '21 at 20:17

The answers posted answered my 'dumbed down' dataset question but I could not get them to work in my situation with my real world dataset. I had posted another question prior to this one concerning fixing my problem while pulling the data in not manipulating the data once in and an answer was formed from that answering this question as well.

It is here

The answer is:

last_valid = None

        check_cols = ['Col1']  # if only need to check a subset of cols for validity, do it here
       

        df = df.astype(str) #convert all columns to strings as I have to combine numbers in the same cell

        df = df.replace('nan','') #get rid of the nan created back to a blank string

        for i, s in df.iterrows():  # This is slow, but probably necessary in this case
           

            """ If all the rows are valid, we want to keep it as a reference in case

            the following rows are not """

            if all(s[check_cols] != ''):

                lvi, last_valid = i, s

                # need to store index and series so we can go back and replace it

                continue

            else:  # here is the critical part

                extra_vals = s[s != '']  # find cells in row that have actual values

                for col in extra_vals.index:

                    """ I'm creating a list and appending here since I don't know

                    your values or how they need to be handled exactly"""

                    last_valid[col] = last_valid[col] + "," + extra_vals[col] #separate by whatever you wish, list was causing issues

                # replace that row in the dataframe

                df.iloc[lvi, :] = last_valid

 

        # drop extra rows:

        df = df[df['Col1'] != ''].reset_index(drop=True)

How to merge dataframe rows based on empty cells in a column

3 Answers3