1

I am trying to group identical columns in a single dataframe, similar to this question: Grouping on identical column names in pandas

However that answer is not working for me. When I apply the accepted answer to that question, my dataframe has '.1' added to the second iteration of the duplicated columns. My duplicated columns do not have duplicated data, which may be the problem?

Here is my table:

Timepoint Col1 Col2 Col3 Col1 Col2 Col3

   1       1    2   3
   2       4    5   6
   3                      7    8    9
   4                      10   11   12

I would like the table to look like this:

Timepoint Col1 Col2 Col3 
     1     1    2   3
     2     4    5   6
     3     7    8   9
     4     10   11  12

But the table looks like this when I apply the linked code:

Timepoint Col1 Col2 Col3 Col1.1 Col2.1 Col3.1

   1       1    2   3
   2       4    5   6
   3                      7      8     9
   4                      10     11    12

My dataframe has hundreds of columns so I need a solution that doesn't specify the columns that need to be grouped.

Note that this is not a duplicate of this question:Shift NaNs to the end of their respective rows because that question does not have duplicated column names and it shifts data to a differently labeled column.

DataScope
  • 29
  • 5

2 Answers2

0

an example of solution:

import pandas as pd


data = """
Timepoint,Col1,Col2,Col3,Col1,Col2,Col3
1,1,2,3,,,
2,4,5,6,,,
3,,,,7,8,9
4,,,,10,11,12 
"""
df = pd.read_csv(pd.compat.StringIO(data), sep=',')
df.rename(columns=lambda x: x.split('.')[0], inplace=True)
print(df)

output:

   Timepoint  Col1  Col2  Col3  Col1  Col2  Col3
0          1   1.0   2.0   3.0   NaN   NaN   NaN
1          2   4.0   5.0   6.0   NaN   NaN   NaN
2          3   NaN   NaN   NaN   7.0   8.0   9.0
3          4   NaN   NaN   NaN  10.0  11.0  12.0

df = df.fillna(0)
print(df)

output:

   Timepoint  Col1  Col2  Col3  Col1  Col2  Col3
0          1   1.0   2.0   3.0   0.0   0.0   0.0
1          2   4.0   5.0   6.0   0.0   0.0   0.0
2          3   0.0   0.0   0.0   7.0   8.0   9.0
3          4   0.0   0.0   0.0  10.0  11.0  12.0

df = df.groupby(level=0, axis=1).sum()
print(df)

output:

   Col1  Col2  Col3  Timepoint
0   1.0   2.0   3.0        1.0
1   4.0   5.0   6.0        2.0
2   7.0   8.0   9.0        3.0
3  10.0  11.0  12.0        4.0
Frenchy
  • 16,386
  • 3
  • 16
  • 39
  • I can't use .sum() because there could be places where data has been duplicated. I tried df = df.groupby(level=0, axis=1).mean(), which is the solution in the original problem I linked to. Unfortunately that code isn't working for me. – DataScope Mar 20 '19 at 13:31
0

Create index by Timepoint by DataFrame.set_index, then use groupby with lambda function with split and aggregate sum, or max, or mean or first. Aggregation method depends of data, but if always mising values like in sample data output is always same:

df = (df.set_index('Timepoint')
       .groupby(lambda x: x.split('.')[0], axis=1).sum()
       .reset_index())
print (df)
   Timepoint  Col1  Col2  Col3
0          1   1.0   2.0   3.0
1          2   4.0   5.0   6.0
2          3   7.0   8.0   9.0
3          4  10.0  11.0  12.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252