I need to complete fields of some columns from two dataframes in order to create a final one.
First dataframe is this. It contains all the kids that are in a group. They have a name and a city to which they belong. There will never be NaN in their name or in their city. But two kids could be named the same, but they will always belong to a different city. (e.g.: there is a Betty from Miami and a Betty from Madrid - two different kids.)
data = {'Kids': ['Ronnie', 'Sofia', 'Manny', 'Claire Ann', 'Joe', 'Betty', 'Cheryl', 'Archie', 'Joe', 'Phil' ,'Luke', 'Alex', 'Haley', 'Joe', 'Betty'],
'City': ['New York City', 'Boston', 'Los Angeles', 'San Francisco', 'Detroit', 'Miami', 'London', 'Bruxelles' ,'Paris', 'Berlin', 'Stockholm', 'Amsterdam', 'Massachusetts', 'Barcelona','Madrid']}
all_kids_df = pd.DataFrame (data, columns = ['Kids', 'City'])
And it looks like this:
Kids City
0 Ronnie New York City
1 Sofia Boston
2 Manny Los Angeles
3 Claire Ann San Francisco
4 Joe Detroit
5 Betty Miami
6 Cheryl London
7 Archie Bruxelles
8 Joe Paris
9 Phil Berlin
10 Luke Stockholm
11 Alex Amsterdam
12 Haley Massachusetts
13 Joe Barcelona
14 Betty Madrid
Some of these kids will go on a trip to the MET museum in NYC. Some wrote their name and their city, some wrote only their name. But none of them only wrote just their city, without writing their names. If two kids have the same name (e.g. Betty), but she only wrote her name, then both girls named Betty will go on this trip (not fair, but they will enjoy it). The difference between them will be the city to where they belong. (look at the example)
The second dataframe is this. It represents the list of the kids that will go on this trip, with their names, the cities to where they belong, and their preferred transport method, from the home to the MIT museum in NYC.
data = {'Kids': ['Sofia', 'Claire Ann', 'Joe', 'Betty', 'Archie', 'Joe', 'Phil' ,'Luke',],
'City': ['Boston', np.nan, 'Detroit', np.nan, 'Bruxelles' , np.nan, 'Berlin', np.nan],
'Preferred Transport': ['Bus', np.nan, 'Train', 'Car', np.nan, 'Airplane', 'Ship', 'Airplane']}
nyc_trip_df = pd.DataFrame (data, columns = ['Kids', 'City', 'Preferred Transport'])
And it looks like this:
Kids City Preferred Transport
0 Sofia Boston Bus
1 Claire Ann NaN NaN
2 Joe Detroit Train
3 Betty NaN Car
4 Archie Bruxelles NaN
5 Joe NaN Airplane
6 Phil Berlin Ship
7 Luke NaN Airplane
So as the above list is not written very clear, it should be corrected. In addition to the mentioned conditions, there is another one:
- If a child wrote all the needed info (name, city and preferred transport), then his/her associated field from the Preferred Transport will remain the same.
- If a child wrote only some of the needed info (name, city), then the associated field from the Preferred Transport will be NaN, as it is missing.
- If a child wrote only his/her name, then all the kids that have the same name will be put on the list, with their cities associated (e.g. only the name Betty appears --> then it will be Betty | Madrid and Betty | Miami) and NaN for both at the Preferred Transport section.
- If for example Joe | Detroit appears and another Joe without his city, but with his preferred transport - Airplane, then Joe | Detroit will remain with his preferred transport (NaN or whatever he chose), while the other Joe will be duplicated into Joe | Paris and Joe | Barcelona, both having Airplane as their Preferred Transport.
The final resulted list should be like this one:
Kids City Preferred Transport
0 Sofia Boston Bus
1 Claire Ann San Francisco NaN
2 Joe Detroit Train
3 Betty Miami Car
4 Betty Madrid Car
5 Archie Bruxelles NaN
6 Joe Paris Airplane
7 Joe Barcelona Airplane
8 Phil Berlin Ship
9 Luke Stockholm Airplane
I tried writing this code, but I feel like I lost at some point data. And it contains so much verbosity.
n = len(kids_list)
aux_kids_list = []
aux_city_list = []
for i in range(0, n):
if (city_list[i] != np.nan & transport_list[i] == np.nan):
transport_list[i] = df.loc[df['Kids'] == kids_list[i] & df['City'] == city_list[i], 'Preferred Transport']
if (city_list[i] == np.nan):
here = -1
for name in kids_list:
here += 1
if name == kids_list[i]:
aux_kids_list.append(kids_list[i])
aux_city_list.append(city_list[here])
aux_transport_list.append(df.loc[df['Kids'] == kids_list[i] & df['City'] == city_list[i], 'Preferred Transport'])
final_kids_list = kids_list + aux_kids_list
final_city_list = city_list + aux_city_list
final_transport_list = transport_list + aux_transport_list
data = {'Kids': final_kids_list,
'City': final_city_list,
'Preferred Transport': final_transport_list }
final_list_nyc_trip_df = pd.DataFrame (data, columns = ['Kids', 'City', 'Preferred Transport'])
final_list_nyc_trip_df = final_list_nyc_trip_df.drop_duplicates(keep='first')
I am at the very beginning, and I highly appreciate all the help you could possibly give to me! Thank you very very much in advanced!