Notes:
If one wants to do it with Pandas dataframes, there are various ways to achieve that.
One approach would creating a dataframe to store the dataframes that have the highest number of rows per sample_id
, and then apply a custom made function. Let's invest a bit more time in that approach.
We will first create a dataframe to store the number of rows that each dataframe has per sample_id
as follows
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()})
df_count['df_a'] = df_count['sample_id'].map(df_a.groupby('sample_id').size())
df_count['df_b'] = df_count['sample_id'].map(df_b.groupby('sample_id').size())
As it will be helpful, let us create a column df_max
that will store the dataframe that has more rows per sample_id
df_count['df_max'] = df_count[['df_a', 'df_b']].idxmax(axis=1)
[Out]:
sample_id df_a df_b df_max
0 1 2 1 df_a
1 2 4 2 df_a
2 3 1 3 df_b
3 4 1 1 df_a
4 5 6 3 df_a
A one-liner to create the desired df_count would look like the following
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()}).assign(df_a=lambda x: x['sample_id'].map(df_a.groupby('sample_id').size()), df_b=lambda x: x['sample_id'].map(df_b.groupby('sample_id').size()), df_max=lambda x: x[['df_a', 'df_b']].idxmax(axis=1))
Now, given df_a
, df_b
, and df_count
, one will want a function to merge the dataframes based on a specific condition:
If df_max
is df_a
, then take the rows from df_a
.
If df_max
is df_b
, then take the rows from df_b
.
One can create a function merge_df
that takes df_a
, df_b
, and df_count
and returns the merged dataframe
def merge_df(df_a, df_b, df_count):
# Create a list to store the dataframes
df_list = []
# Iterate over the rows in df_count
for index, row in df_count.iterrows():
# If df_max is df_a, then take the rows from df_a
if row['df_max'] == 'df_a':
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# If df_max is df_b, then take the rows from df_b
elif row['df_max'] == 'df_b':
df_list.append(df_b[df_b['sample_id'] == row['sample_id']])
# If df_max is neither df_a nor df_b, then use the first dataframe
else:
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# Concatenate the dataframes in df_list and return the result. Also, reset the index.
return pd.concat(df_list).reset_index(drop=True)
Then one can apply the function
df_merged = merge_df(df_a, df_b, df_count)
[Out]:
sample_id
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
10 5
11 5
12 5
13 5
14 5
15 5