0

I'm new to using Pandas and I'm currently trying to clean up one of my dataframes so I can merge/join with another dataframe based on a shared ID. The problem is that one of my dataframes has multiple ID columns that may both include lists of IDs, all of which are valid.

I am trying to put each ID on a unique row line to make the merging process easier. Here is what my data may look like:

import pandas as pd
import numpy as np
data = pd.DataFrame({"Name":["Bob", "Mary"], "Age":[50,55],"ID1":["1,2",np.NaN],"ID2":["1,3","4,5,6"]})
   Age  ID1    ID2  Name
0   50  1,2    1,3   Bob
1   55  NaN  4,5,6  Mary

This is what I would like the output to be (unless there is an easier way to set this dataframe up for merging with another dataframe based on ID).

dataFixed = pd.DataFrame({"Name":["Bob", "Bob", "Bob","Mary","Mary","Mary"], "Age":[50,50,50,55,55,55],"ID":["1,2","3","4","5","6"]})  
   Age ID  Name
0   50  1   Bob
1   50  2   Bob
2   50  3   Bob
3   55  4  Mary
4   55  5  Mary
5   55  6  Mary

My initial thought was to combine each of the ID strings by making them into lists of IDs rather than comma separated strings, and creating a set of IDs as a single column as my starting point. But I can't even get to that point as I am getting a "unhashable type: 'list'" error. This is some of the code from my initial efforts:

data["ID1"] = data["ID1"].str.split(",")
data["ID2"] = data["ID2"].str.split(",")
data["ID"] = data[["ID1", "ID2"]].apply(lambda x: ",".join(sorted(set(x), key=str)))

2 Answers2

0

IIUC, string unnesting

data=data.fillna('')
data['ID']=data.ID1.str.split(',')+data.ID2.str.split(',')
data.set_index(['Age','Name']).ID.apply(pd.Series).replace('',np.nan).stack().drop_duplicates().reset_index().drop('level_2',1)
Out[560]: 
   Age  Name  0
0   50   Bob  1
1   50   Bob  2
2   50   Bob  3
3   55  Mary  4
4   55  Mary  5
5   55  Mary  6
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Here is my solution. A lot was borrowed from this question.

An issue I had with a lot of solutions I found was that they were required manual indexing of the remaining columns when rebuilding a new dataframe. I wanted to avoid this because my actual frame is much larger than 3 columns long .

data[["ID1","ID2"]] = data[["ID1","ID2"]].fillna("")
data["ID"] = data["ID1"] + "," + data["ID2"]
data["ID"] = data["ID"].apply(lambda x: list(filter(None,list(set(x.split(","))))))
data = data.drop(["ID1", "ID2"], axis=1)

rows = []
_ = data.apply(lambda row: [rows.append(row.tolist()[0:-1] + [i])
                               for i in row.ID], axis=1)
data_fixed = pd.DataFrame(rows, columns=data.columns[0:])