I have a dataframe that includes personal data, including a column called First Name
. Due to the way the data was entered, some of the rows in this column contain two individuals, i.e. "Dan and Sally". I am able to filter these rows out based on the other criteria I need. What I would like to do is duplicate everything in those rows and split them so I get one for "Dan" and one for "Sally". Simplified, something like:
df1 = pd.DataFrame({'First Name':['Steve','Jim','Bob','Dan and Sally',
'Mary','Jane','Bill and Kathy'],
'Last Name':['Jobs','Croce','Dylan','Blapper',
'Carpenter','Goodall','Johnson']})
This results in:
>>> df1
First Name Last Name
0 Steve Jobs
1 Jim Croce
2 Bob Dylan
3 Dan and Sally Blapper
4 Mary Carpenter
5 Jane Goodall
6 Bill and Kathy Johnson
Then, I can filter down using:
doubles = df1[df1['First Name'].str.contains(' and ')]
First Name Last Name
3 Dan and Sally Blapper
6 Bill and Kathy Johnson
What I now want to do is duplicate the entire row for each of these, but have the First Name
field in the first row be everything before ' and ' and the second one be everything after it:
First Name Last Name
Dan Blapper
Sally Blapper
Bill Johnson
Kathy Johnson
These entries would replace the existing "and" entries in the original dataframe, so I have only true individuals (recognizing that I'm going to be duplicating contact info between two members of a household).
I thought I could use repeat
and then filter them pairwise, but it seems like there's probably an obvious solution I'm missing.