0

I have a dataframe that includes personal data, including a column called First Name. Due to the way the data was entered, some of the rows in this column contain two individuals, i.e. "Dan and Sally". I am able to filter these rows out based on the other criteria I need. What I would like to do is duplicate everything in those rows and split them so I get one for "Dan" and one for "Sally". Simplified, something like:

df1 = pd.DataFrame({'First Name':['Steve','Jim','Bob','Dan and Sally',
                                  'Mary','Jane','Bill and Kathy'],
                    'Last Name':['Jobs','Croce','Dylan','Blapper',
                                 'Carpenter','Goodall','Johnson']})

This results in:

>>> df1
       First Name  Last Name
0           Steve       Jobs
1             Jim      Croce
2             Bob      Dylan
3   Dan and Sally    Blapper
4            Mary  Carpenter
5            Jane    Goodall
6  Bill and Kathy    Johnson

Then, I can filter down using:

doubles = df1[df1['First Name'].str.contains(' and ')]

       First Name Last Name
3   Dan and Sally   Blapper
6  Bill and Kathy   Johnson

What I now want to do is duplicate the entire row for each of these, but have the First Name field in the first row be everything before ' and ' and the second one be everything after it:

First Name   Last Name
       Dan     Blapper
     Sally     Blapper
      Bill     Johnson
     Kathy     Johnson

These entries would replace the existing "and" entries in the original dataframe, so I have only true individuals (recognizing that I'm going to be duplicating contact info between two members of a household).

I thought I could use repeat and then filter them pairwise, but it seems like there's probably an obvious solution I'm missing.

markf
  • 127
  • 7

2 Answers2

2

Do split with explode

df1['First Name']=df1['First Name'].str.split(' and ')
df1=df1.explode('First Name')
df1
  First Name  Last Name
0      Steve       Jobs
1        Jim      Croce
2        Bob      Dylan
3        Dan    Blapper
3      Sally    Blapper
4       Mary  Carpenter
5       Jane    Goodall
6       Bill    Johnson
6      Kathy    Johnson
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Adapted from this amazing answer:

df1 = pd.DataFrame(df1["First Name"].str.split(' and ').tolist(), index=df1["Last Name"]).stack().reset_index()[[0, 'Last Name']]
df1.columns = ["First Name", "Last Name"]
print(df1)
#  First Name  Last Name
#0      Steve       Jobs
#1        Jim      Croce
#2        Bob      Dylan
#3        Dan    Blapper
#4      Sally    Blapper
#5       Mary  Carpenter
#6       Jane    Goodall
#7       Bill    Johnson
#8      Kathy    Johnson
Anwarvic
  • 12,156
  • 4
  • 49
  • 69