Join columns which were mangled on import

Question

When using pd.read_csv('myfile.csv', delimiter=';') on a csv which duplicated column names, pandas mangles the duplicated columns with .1, .2, .# (# is the number of the duplicated column)

My example csv looks like this:

data1	data2	A	B	B	C	C
abc	NaN	text1	text2	text3	text4	text5
def	456	text2	text4	text3	text5	text1

Data1;Data2;A;B;B;C;C
abc;;text1;text2;text3;text4;text5
def;456;text2;text4;text3;text5;text1

After import to dataframe, the duplicated columns get mangled:

This output is expected.

But I wish to combine these duplicated columns and their rows as comma-seperated strings.

So the desired output would look like: (order of columns is not important)

data1	data2	A	B	C
abc	123	text1	text2,text3	text4,text5
def	456	text2	text4,text3	text5,text1

How can I achieve that with pandas in python?

I found the following question when searching for the problem:

Concatenate cells into a string with separator pandas python

But I don't know how to apply the answer from that question to only those columns which are mangled.

I've adjusted the original question to include numbers in the given dataset. The duplicated columns can always be seen as strings, since they'll end up as strings eventually. — HackXIt, Sep 01 '21 at 10:12

mozway · Accepted Answer · 2021-09-01T09:35:57.827

2

You can rename your columns, stack, join, unstack:

df = pd.read_csv('filename.csv', sep=';')

# remove the '.x' in columns
df.columns = df.columns.map(lambda x: x.split('.')[0])

# reshaping
(df.set_index(['Data1', 'Data2']) # set those columns aside
   .stack()                       # columns to rows
   .groupby(level=[0,1,2])        # group by all
   .apply(','.join)               # join duplicates
   .unstack()                     # A/B/C back to columns
)

output:

                 A            B            C
Data1 Data2                                 
abc   def    text1  text2,text3  text4,text5
asd   fgh    text2  text4,text3  text5,text1

edited Sep 01 '21 at 09:35

answered Sep 01 '21 at 09:26

mozway

194,879
13
39
75

That looks to be working, could you explain what you're doing here? – HackXIt Sep 01 '21 at 09:28
@HackXIt I commented the code, let me know if you have questions – mozway Sep 01 '21 at 09:36
Thanks for the comments. Is there a way to set aside columns which aren't duplicated? Inserting each column manually that isn't duplicated seems cumbersome with larger datasets. – HackXIt Sep 01 '21 at 09:43
Is the set_index even necessary? Column A isn't duplicated and left untouched I just noticed. – HackXIt Sep 01 '21 at 09:45
By the way, if there's more than 3 rows, using `groupby(level=list(range(len(df.index))))` instead works more generically. There might be a more elegant way to do this though. – HackXIt Sep 01 '21 at 09:54
Your solution drops rows with NaN values. Is there a way to consider `NaN` and numbers? – HackXIt Sep 01 '21 at 10:19

score 0 · Answer 2 · answered Sep 01 '21 at 09:31

0

Taking inspiration from this SO thread:

import pandas as pd

df = pd.read_csv(r'./example.csv', delimiter=';')

def sjoin(x): 
    return ';'.join(x[x.notnull()].astype(str))

df = df.groupby(lambda col: col.split('.')[0], axis=1).apply(lambda x: x.apply(sjoin, axis=1))

Which results in:

       A            B            C Data1 Data2
0  text1  text2;text3  text4;text5   abc   def
1  text2  text4;text3  text5;text1   asd   fgh

answered Sep 01 '21 at 09:31

ChrisOram

1,254
1
5
17

I thought of that for a moment, but this involves a loop within a loop, probably not efficient on large datasets – mozway Sep 01 '21 at 09:37
Indeed, the decision on which method to use is dependent on the size of the dataset & how much you need to prioritize efficiency. – ChrisOram Sep 01 '21 at 09:40
For those interested, this method is 300 times slower than @mozway for the example csv given repeated for 60,000 rows – ChrisOram Sep 01 '21 at 09:50

w-m · Answer 3 · 2021-09-01T10:06:22.637

0

Here's another solution, grouping the mangled columns with a regular expression:

# (.*?): capture the original column name at the beginning of the string
# potentially followed by a dot and at least one digit
shared_groupname = r"(.*?)(?:\.\d+)?$"

Let's see that in action:

>>> df.columns.str.extract(shared_groupname) 
       0
0  Data1
1  Data2
2      A
3      B
4      B
5      C
6      C

Then groupby this original column name, and apply the join:

grouped = df.groupby(df.columns.str.extract(shared_groupname, expand=False), axis=1)

res = grouped.apply(lambda x: x.dropna().astype(str).apply(', '.join, axis=1))

edited Sep 01 '21 at 10:06

answered Sep 01 '21 at 09:54

w-m

10,772
1
42
49

This throws `TypeError: sequence item 0: expected str instance, float found` if the dataset contains something else than strings. _(floats for example)_ Works for string-only datasets though. – HackXIt Sep 01 '21 at 10:02
Ah. That's fixed with an additional `.dropna().astype(str)` in the apply, from the answer you cite above. – w-m Sep 01 '21 at 10:06
I've adjusted the original question to include numbers. – HackXIt Sep 01 '21 at 10:10
I'm getting `ValueError: No axis named 1 for object type Series` when applied in the larger dataset. I'm unsure where this comes from. – HackXIt Sep 01 '21 at 10:14

Join columns which were mangled on import

3 Answers3