I have a csv
which looks like below
AB22,AD34,GQ22,BQ77a1,BQ77a2,BQ77a3,CA33,LA21,MO22c1,MO22c4
"ab,vd","va,ca","aa","ba,po,la","ma,na,qa","la,oo,aa","ca","na,la","re,te","ka,lo"
"vb,zr","ra,oa","na","oa,yo,sa","xa,ia,ga","lk,po,za","ja","ka,la","rv,gh","xa,jk"
The above csv
is just a shorter version of the bigger csv
I have. It has more rows and more columns. But this example is good enough for my question.
Now I have a list of column names which looks like this
columns = ["BQ77", "MO22"]
Now I need to look up the columns in the csv
which looks like each of column names I have in the list and collapse such columns into one where I make the values comma separated.
For example for the column BQ77
, the columns that look like it in the csv
are BQ77a1,BQ77a2,BQ77a3
and for the column MO22
, the columns in the csv
are MO22c1,MO22c4
Now such columns need to be collapsed and the values need to joined together (comma separated) and the column name should be the column from the columns
list.
So my csv
should look like this
AB22,AD34,GQ22,BQ77,CA33,LA21,MO22
"ab,vd","va,ca","aa","ba,po,la,ma,na,qa,la,oo,aa","ca","na,la","re,te,ka,lo"
"vb,zr","ra,oa","na","oa,yo,sa,xa,ia,ga,lk,po,za","ja","ka,la","rv,gh,xa,jk"
I created a mapping of columns given in the list with the columns in the csv
which match them. So this is what I did
import pandas as pd
columns = ["BQ77", "MO22"]
df = pd.read_excel(io="/Users/souvikray/Downloads/test.xlsx", sheet_name="A1") // file originally is an excel file
headers = df.columns.tolist()
col_map = {}
for column1 in columns:
for column2 in headers:
if column1 in column2:
if col_map.get(column1):
col_map[column1].append(column2)
else:
col_map[column1] = [column2]
So I get a mapping
col_map = {"BQ77": ["BQ77a1", "BQ77a2", "BQ77a3"], "MO22": ["MO22c1","MO22c4"]}
Now I am not sure how can I use this information to do a collapse of similar looking columns. I also looked up online and found this question Merge multiple column values into one column in python pandas but here the columns are continuous but in my case, the required columns occur at certain places
Is there any way this can be done?
Note: Since I didn't post the entire csv, so one thing to keep in mind is the column values may have int and float too.