How to join two dataframes in Python?

Question

I need to complete fields of some columns from two dataframes in order to create a final one.

First dataframe is this. It contains all the kids that are in a group. They have a name and a city to which they belong. There will never be NaN in their name or in their city. But two kids could be named the same, but they will always belong to a different city. (e.g.: there is a Betty from Miami and a Betty from Madrid - two different kids.)

data = {'Kids': ['Ronnie', 'Sofia', 'Manny', 'Claire Ann', 'Joe', 'Betty', 'Cheryl', 'Archie', 'Joe', 'Phil' ,'Luke', 'Alex', 'Haley', 'Joe', 'Betty'],
       'City': ['New York City', 'Boston', 'Los Angeles', 'San Francisco', 'Detroit', 'Miami', 'London', 'Bruxelles' ,'Paris', 'Berlin', 'Stockholm', 'Amsterdam', 'Massachusetts', 'Barcelona','Madrid']}

all_kids_df = pd.DataFrame (data, columns = ['Kids', 'City'])

And it looks like this:

          Kids           City
0       Ronnie  New York City
1        Sofia         Boston
2        Manny    Los Angeles
3   Claire Ann  San Francisco
4          Joe        Detroit
5        Betty          Miami
6       Cheryl         London
7       Archie      Bruxelles
8          Joe          Paris
9         Phil         Berlin
10        Luke      Stockholm
11        Alex      Amsterdam
12       Haley  Massachusetts
13         Joe      Barcelona
14       Betty         Madrid

Some of these kids will go on a trip to the MET museum in NYC. Some wrote their name and their city, some wrote only their name. But none of them only wrote just their city, without writing their names. If two kids have the same name (e.g. Betty), but she only wrote her name, then both girls named Betty will go on this trip (not fair, but they will enjoy it). The difference between them will be the city to where they belong. (look at the example)

The second dataframe is this. It represents the list of the kids that will go on this trip, with their names, the cities to where they belong, and their preferred transport method, from the home to the MIT museum in NYC.

data = {'Kids': ['Sofia', 'Claire Ann', 'Joe', 'Betty', 'Archie', 'Joe', 'Phil' ,'Luke',],
       'City': ['Boston', np.nan, 'Detroit', np.nan, 'Bruxelles' , np.nan, 'Berlin', np.nan],
       'Preferred Transport': ['Bus', np.nan, 'Train', 'Car', np.nan, 'Airplane', 'Ship', 'Airplane']}

nyc_trip_df = pd.DataFrame (data, columns = ['Kids', 'City', 'Preferred Transport'])

And it looks like this:

         Kids       City Preferred Transport
0       Sofia     Boston                 Bus
1  Claire Ann        NaN                 NaN
2         Joe    Detroit               Train
3       Betty        NaN                 Car
4      Archie  Bruxelles                 NaN
5         Joe        NaN            Airplane
6        Phil     Berlin                Ship
7        Luke        NaN            Airplane

So as the above list is not written very clear, it should be corrected. In addition to the mentioned conditions, there is another one:

If a child wrote all the needed info (name, city and preferred transport), then his/her associated field from the Preferred Transport will remain the same.
If a child wrote only some of the needed info (name, city), then the associated field from the Preferred Transport will be NaN, as it is missing.
If a child wrote only his/her name, then all the kids that have the same name will be put on the list, with their cities associated (e.g. only the name Betty appears --> then it will be Betty | Madrid and Betty | Miami) and NaN for both at the Preferred Transport section.
If for example Joe | Detroit appears and another Joe without his city, but with his preferred transport - Airplane, then Joe | Detroit will remain with his preferred transport (NaN or whatever he chose), while the other Joe will be duplicated into Joe | Paris and Joe | Barcelona, both having Airplane as their Preferred Transport.

The final resulted list should be like this one:

         Kids           City Preferred Transport
0       Sofia         Boston                 Bus
1  Claire Ann  San Francisco                 NaN
2         Joe        Detroit               Train
3       Betty          Miami                 Car
4       Betty         Madrid                 Car
5      Archie      Bruxelles                 NaN
6         Joe          Paris            Airplane
7         Joe      Barcelona            Airplane
8        Phil         Berlin                Ship
9        Luke      Stockholm            Airplane

I tried writing this code, but I feel like I lost at some point data. And it contains so much verbosity.

n = len(kids_list)
aux_kids_list = []
aux_city_list = []
for i in range(0, n):
    if (city_list[i] != np.nan & transport_list[i] == np.nan):
        transport_list[i] = df.loc[df['Kids'] == kids_list[i] & df['City'] == city_list[i], 'Preferred Transport']
    if (city_list[i] == np.nan):
        here = -1
        for name in kids_list:
            here += 1
            if name == kids_list[i]:
                aux_kids_list.append(kids_list[i])
                aux_city_list.append(city_list[here])
                aux_transport_list.append(df.loc[df['Kids'] == kids_list[i] & df['City'] == city_list[i], 'Preferred Transport'])
                

final_kids_list = kids_list + aux_kids_list
final_city_list = city_list + aux_city_list
final_transport_list = transport_list + aux_transport_list


data = {'Kids': final_kids_list,
       'City': final_city_list,
       'Preferred Transport': final_transport_list }

final_list_nyc_trip_df = pd.DataFrame (data, columns = ['Kids', 'City', 'Preferred Transport'])
final_list_nyc_trip_df = final_list_nyc_trip_df.drop_duplicates(keep='first')

I am at the very beginning, and I highly appreciate all the help you could possibly give to me! Thank you very very much in advanced!

I would suggest trying to break this problem into discrete steps and asking a separate question for each — anon01, Apr 07 '21 at 19:10
Thank you, @anon01! I started doing this, hopefully by breaking the problem into smaller steps I will be able to properly resolve it! — Elisa L., Apr 07 '21 at 19:12

score 0 · Answer 1 · edited Jun 18 '21 at 02:16

I think that you could take the following approach:

Construct a new, shared key between these two dataframes that is shared between the same people (even when data is missing)
Use the df.merge function to merge these two dataframes together.

The hard part in the above is making sure that your shared key makes sense. I think that concatenating the name to the city should work - although it might be a little wacky when folks didn't put their city. In this case, I think you can just construct another shared key (just with names) and merge on that as well.

How to join two dataframes in Python?

1 Answers1