0

I have a csv which looks like below

AB22,AD34,GQ22,BQ77a1,BQ77a2,BQ77a3,CA33,LA21,MO22c1,MO22c4
"ab,vd","va,ca","aa","ba,po,la","ma,na,qa","la,oo,aa","ca","na,la","re,te","ka,lo"
"vb,zr","ra,oa","na","oa,yo,sa","xa,ia,ga","lk,po,za","ja","ka,la","rv,gh","xa,jk"

The above csv is just a shorter version of the bigger csv I have. It has more rows and more columns. But this example is good enough for my question.

Now I have a list of column names which looks like this

columns = ["BQ77", "MO22"]

Now I need to look up the columns in the csv which looks like each of column names I have in the list and collapse such columns into one where I make the values comma separated.

For example for the column BQ77, the columns that look like it in the csv are BQ77a1,BQ77a2,BQ77a3 and for the column MO22, the columns in the csv are MO22c1,MO22c4

Now such columns need to be collapsed and the values need to joined together (comma separated) and the column name should be the column from the columns list.

So my csv should look like this

AB22,AD34,GQ22,BQ77,CA33,LA21,MO22
"ab,vd","va,ca","aa","ba,po,la,ma,na,qa,la,oo,aa","ca","na,la","re,te,ka,lo"
"vb,zr","ra,oa","na","oa,yo,sa,xa,ia,ga,lk,po,za","ja","ka,la","rv,gh,xa,jk"

I created a mapping of columns given in the list with the columns in the csv which match them. So this is what I did

import pandas as pd

columns = ["BQ77", "MO22"]
df = pd.read_excel(io="/Users/souvikray/Downloads/test.xlsx", sheet_name="A1") // file originally is an excel file
headers = df.columns.tolist()
col_map = {}
for column1 in columns:
    for column2 in headers:
        if column1 in column2:
            if col_map.get(column1):
                col_map[column1].append(column2)
            else:
                col_map[column1] = [column2]

So I get a mapping

col_map = {"BQ77": ["BQ77a1", "BQ77a2", "BQ77a3"], "MO22": ["MO22c1","MO22c4"]}

Now I am not sure how can I use this information to do a collapse of similar looking columns. I also looked up online and found this question Merge multiple column values into one column in python pandas but here the columns are continuous but in my case, the required columns occur at certain places

Is there any way this can be done?

Note: Since I didn't post the entire csv, so one thing to keep in mind is the column values may have int and float too.

Souvik Ray
  • 2,899
  • 5
  • 38
  • 70

2 Answers2

1

Try this,df is your main dataframe

A=[["ab,vd","va,ca","aa","ba,po,la","ma,na,qa","la,oo,aa","ca","na,la","re,te","ka,lo"],
   ["vb,zr","ra,oa","na","oa,yo,sa","xa,ia,ga","lk,po,za","ja","ka,la","rv,gh","xa,jk"]]
df=pd.DataFrame(A ,columns = ["AB22","AD34","GQ22","BQ77a1","BQ77a2","BQ77a3","CA33","LA21","MO22c1","MO22c4"])   
columns = ["BQ77", "MO22"]
for c in columns:
    cc = [i for i in df.columns if i.startswith(c)]
    df[c] = df[cc].apply(','.join, axis=1)
    df=df.drop(columns=cc,axis=1)

output

AB22   AD34 GQ22 CA33   LA21                        BQ77         MO22
0  ab,vd  va,ca   aa   ca  na,la  ba,po,la,ma,na,qa,la,oo,aa  re,te,ka,lo
1  vb,zr  ra,oa   na   ja  ka,la  oa,yo,sa,xa,ia,ga,lk,po,za  rv,gh,xa,jk
Rajith Thennakoon
  • 3,975
  • 2
  • 14
  • 24
  • hi, your solution maybe correct .But because I didn't post the entire csv, so you are not aware of this. Actually the column values might have `float` and `int` too. So it throws error as `TypeError: ('sequence item 0: expected string, float found', u'occurred at index 0')`. Can there be a check where the `join` happens only for string values. Sorry I didn't' make this obvious. – Souvik Ray May 26 '20 at 08:37
  • @SouvikRay use `df[cc].astype(str).apply(','.join, axis=1)` – Shubham Sharma May 26 '20 at 08:42
  • @ShubhamSharma thanks. I moved one step further. But now it throws a `UnicodeEncodeError`. Guess I have to do something similar like this https://stackoverflow.com/a/52484350/5550284 – Souvik Ray May 26 '20 at 08:59
  • @ShubhamSharma I figured it out. On using `as type('unicode')` it doesn't throw any errors. – Souvik Ray May 26 '20 at 09:10
  • @SouvikRay Good for you. – Shubham Sharma May 26 '20 at 09:22
  • thank you @ShubhamSharma for helping while i am off... Souvik happy coding!! – Rajith Thennakoon May 26 '20 at 10:24
1

IIUC, using join stack and filter

s = df.filter(regex='|'.join(columns)).stack().reset_index()

df1 = df.join(
    s.groupby([s["level_0"], s["level_1"].str[:2]])[0].agg(",".join).unstack(1)
).drop(s["level_1"].unique(), axis=1).rename(columns={'BQ' : 'BQ77', 'MO' : 'MO22' })

print(df1)

    AB22   AD34 GQ22 CA33   LA21                        BQ77         MO22
0  ab,vd  va,ca   aa   ca  na,la  ba,po,la,ma,na,qa,la,oo,aa  re,te,ka,lo
1  vb,zr  ra,oa   na   ja  ka,la  oa,yo,sa,xa,ia,ga,lk,po,za  rv,gh,xa,jk
Umar.H
  • 22,559
  • 7
  • 39
  • 74