I'm new to using Pandas and I'm currently trying to clean up one of my dataframes so I can merge/join with another dataframe based on a shared ID. The problem is that one of my dataframes has multiple ID columns that may both include lists of IDs, all of which are valid.
I am trying to put each ID on a unique row line to make the merging process easier. Here is what my data may look like:
import pandas as pd
import numpy as np
data = pd.DataFrame({"Name":["Bob", "Mary"], "Age":[50,55],"ID1":["1,2",np.NaN],"ID2":["1,3","4,5,6"]})
Age ID1 ID2 Name
0 50 1,2 1,3 Bob
1 55 NaN 4,5,6 Mary
This is what I would like the output to be (unless there is an easier way to set this dataframe up for merging with another dataframe based on ID).
dataFixed = pd.DataFrame({"Name":["Bob", "Bob", "Bob","Mary","Mary","Mary"], "Age":[50,50,50,55,55,55],"ID":["1,2","3","4","5","6"]})
Age ID Name
0 50 1 Bob
1 50 2 Bob
2 50 3 Bob
3 55 4 Mary
4 55 5 Mary
5 55 6 Mary
My initial thought was to combine each of the ID strings by making them into lists of IDs rather than comma separated strings, and creating a set of IDs as a single column as my starting point. But I can't even get to that point as I am getting a "unhashable type: 'list'" error. This is some of the code from my initial efforts:
data["ID1"] = data["ID1"].str.split(",")
data["ID2"] = data["ID2"].str.split(",")
data["ID"] = data[["ID1", "ID2"]].apply(lambda x: ",".join(sorted(set(x), key=str)))