0

When using pd.read_csv('myfile.csv', delimiter=';') on a csv which duplicated column names, pandas mangles the duplicated columns with .1, .2, .# (# is the number of the duplicated column)

My example csv looks like this:

data1 data2 A B B C C
abc NaN text1 text2 text3 text4 text5
def 456 text2 text4 text3 text5 text1
Data1;Data2;A;B;B;C;C
abc;;text1;text2;text3;text4;text5
def;456;text2;text4;text3;text5;text1

After import to dataframe, the duplicated columns get mangled:

mangled_columns

This output is expected.

But I wish to combine these duplicated columns and their rows as comma-seperated strings.

So the desired output would look like: (order of columns is not important)

data1 data2 A B C
abc 123 text1 text2,text3 text4,text5
def 456 text2 text4,text3 text5,text1

How can I achieve that with pandas in python?

I found the following question when searching for the problem:

Concatenate cells into a string with separator pandas python

But I don't know how to apply the answer from that question to only those columns which are mangled.

HackXIt
  • 422
  • 5
  • 17
  • I've adjusted the original question to include numbers in the given dataset. The duplicated columns can always be seen as strings, since they'll end up as strings eventually. – HackXIt Sep 01 '21 at 10:12

3 Answers3

2

You can rename your columns, stack, join, unstack:

df = pd.read_csv('filename.csv', sep=';')

# remove the '.x' in columns
df.columns = df.columns.map(lambda x: x.split('.')[0])

# reshaping
(df.set_index(['Data1', 'Data2']) # set those columns aside
   .stack()                       # columns to rows
   .groupby(level=[0,1,2])        # group by all
   .apply(','.join)               # join duplicates
   .unstack()                     # A/B/C back to columns
)

output:

                 A            B            C
Data1 Data2                                 
abc   def    text1  text2,text3  text4,text5
asd   fgh    text2  text4,text3  text5,text1
mozway
  • 194,879
  • 13
  • 39
  • 75
  • That looks to be working, could you explain what you're doing here? – HackXIt Sep 01 '21 at 09:28
  • @HackXIt I commented the code, let me know if you have questions – mozway Sep 01 '21 at 09:36
  • Thanks for the comments. Is there a way to set aside columns which aren't duplicated? Inserting each column manually that isn't duplicated seems cumbersome with larger datasets. – HackXIt Sep 01 '21 at 09:43
  • Is the set_index even necessary? Column A isn't duplicated and left untouched I just noticed. – HackXIt Sep 01 '21 at 09:45
  • By the way, if there's more than 3 rows, using `groupby(level=list(range(len(df.index))))` instead works more generically. There might be a more elegant way to do this though. – HackXIt Sep 01 '21 at 09:54
  • Your solution drops rows with NaN values. Is there a way to consider `NaN` and numbers? – HackXIt Sep 01 '21 at 10:19
0

Taking inspiration from this SO thread:

import pandas as pd

df = pd.read_csv(r'./example.csv', delimiter=';')

def sjoin(x): 
    return ';'.join(x[x.notnull()].astype(str))

df = df.groupby(lambda col: col.split('.')[0], axis=1).apply(lambda x: x.apply(sjoin, axis=1))

Which results in:

       A            B            C Data1 Data2
0  text1  text2;text3  text4;text5   abc   def
1  text2  text4;text3  text5;text1   asd   fgh
ChrisOram
  • 1,254
  • 1
  • 5
  • 17
  • I thought of that for a moment, but this involves a loop within a loop, probably not efficient on large datasets – mozway Sep 01 '21 at 09:37
  • Indeed, the decision on which method to use is dependent on the size of the dataset & how much you need to prioritize efficiency. – ChrisOram Sep 01 '21 at 09:40
  • For those interested, this method is 300 times slower than @mozway for the example csv given repeated for 60,000 rows – ChrisOram Sep 01 '21 at 09:50
0

Here's another solution, grouping the mangled columns with a regular expression:

# (.*?): capture the original column name at the beginning of the string
# potentially followed by a dot and at least one digit
shared_groupname = r"(.*?)(?:\.\d+)?$"

Let's see that in action:

>>> df.columns.str.extract(shared_groupname) 
       0
0  Data1
1  Data2
2      A
3      B
4      B
5      C
6      C

Then groupby this original column name, and apply the join:

grouped = df.groupby(df.columns.str.extract(shared_groupname, expand=False), axis=1)

res = grouped.apply(lambda x: x.dropna().astype(str).apply(', '.join, axis=1))
w-m
  • 10,772
  • 1
  • 42
  • 49
  • This throws `TypeError: sequence item 0: expected str instance, float found` if the dataset contains something else than strings. _(floats for example)_ Works for string-only datasets though. – HackXIt Sep 01 '21 at 10:02
  • Ah. That's fixed with an additional `.dropna().astype(str)` in the apply, from the answer you cite above. – w-m Sep 01 '21 at 10:06
  • I've adjusted the original question to include numbers. – HackXIt Sep 01 '21 at 10:10
  • I'm getting `ValueError: No axis named 1 for object type Series` when applied in the larger dataset. I'm unsure where this comes from. – HackXIt Sep 01 '21 at 10:14