so I have a csv file as below but with 1300 columns
id | Date | A | A | B | B | B | B | C | C | C |
---|---|---|---|---|---|---|---|---|---|---|
A | 2009 | Y | Y | |||||||
B | 2009 | Y | Y | Y | Y | |||||
C | 2011 | Y | Y | Y | Y | |||||
D | 2013 | Y | Y | Y | ||||||
E | 2015 | Y | Y | Y | ||||||
F | 2017 | Y | Y | Y |
I want to combine the column values of all columns that have the same column name to look like below:
id | Date | A | B | C |
---|---|---|---|---|
A | 2009 | Y | Y | |
B | 2009 | Y,Y | Y,Y,Y | |
C | 2011 | Y,Y | Y,Y | |
D | 2013 | Y | Y,Y | |
E | 2015 | Y | Y,Y | |
F | 2017 | Y | Y,Y |
I've been searching on how to combine columns with same column names in excel, r and python, but most of the samples either involve summing up the numerical values instead of concatenating strings or have duplicate rows instead of columns.
I've tried the following code I saw on Stackoverflow
pd.concat(x for _, x in df.groupby(df.columns.duplicated(), axis=1))
But for some reason, I get this output:
id | Date | A | A.1 | B | B.1 | B.2 | B.3 | C | C.1 | C.2 |
---|---|---|---|---|---|---|---|---|---|---|
A | 2009 | Y | Y | |||||||
B | 2009 | Y | Y | Y | Y | |||||
C | 2011 | Y | Y | Y | Y | |||||
D | 2013 | Y | Y | Y | ||||||
E | 2015 | Y | Y | Y | ||||||
F | 2017 | Y | Y | Y |
I feel like this should be easy, but I can't seem to find a solution around this.