0

So, basically I have two dataframes one is df1 and df2 so each size is [9000000 rows X 11 columns] and [2700000 rows X 18 columns].

Following df1 looks like this,

| Index | Frame | Chunk_Ids |...|P_filter|
|:------|:-----:|:---------:|:-:|:------:|
| 1     | 1     | 20001     |...|   0.142|
| 2     | 1     | 20001     |...|   0.146|
| 3     | 1     | 20001     |...|   0.147|
| .     | .     | .         |...|       .|
| 900000| 260000   | 200020 |0.245| 0.142|

Following `df2` looks like this, 
| Index | Frame  | Chunk_Ids |...| P_filter|
|:------|:------:|:---------:|:-:|:-------:|
| 3     | 1      |    19999  |...|    0.142|
| 6     | 1      |     20001 |...|    0.146|
| 7     | 1      |     20001 |...|    0.147|
| .     | .      | .         |...|        .|
| 270000| 260000 | 200019    |...|    0.142|

I have two questions, how to merge two dataframes with below condition,

  1. Merge if both df contain common columns chunk IDs without missing any data
  2. Compare df1 index with df2 index, and if the particular index is missing then replace whole respective df2 row with '-1'.

I tried and burnt my head out from last two days with many methods but I can't satisify both conditions. Any leads will be appreciated.

Benjamin Rio
  • 652
  • 2
  • 17
iamniki
  • 553
  • 3
  • 11

1 Answers1

2

I didn't test because you did not provide usable examples - but something like this should work

df_merged = df1.merge(df2, on='Chunk_Ids', suffixes=['_df1', '_df2']
df_merged['index_missing'] = df_merged[['Index_df1', 'Index_df2']].apply(lambda x: sum(pd.isnull(x))) > 0
df2_columns = [str(x) + '_df2' for x in df2.columns.tolist()]
df_merged.loc[df_merged.index_missing, df2_columns] = -1

I'm not entirely sure what you meant by not missing any data but you might prefer to use outer join .merge(how='outer'). By default inner join is performed, meaning rows without pair will be lost.

Patryk Kowalski
  • 561
  • 3
  • 13
  • Thanks, but this method gives me memory error while merging. – iamniki Jan 19 '22 at 12:19
  • In that case you might be interested in [merging dataframes larger than memory in dask](https://stackoverflow.com/questions/37756991/best-way-to-join-two-large-datasets-in-pandas) – Patryk Kowalski Jan 19 '22 at 12:22
  • `missing any data ` mean in df1 chunkid starts from 20000 but where as df2 starts from 19999, so here I'm not interested with 19999 but ignore all rows with 199999 chunks and also vice versa. Chunk Id should be common when you merge it and later compare index and add -1 on l2 where there is missing index. – iamniki Jan 19 '22 at 12:23
  • Looks like you're looking to do an outer join on Index and Chunk_id, and use merge parameter `indicator=True` to sort out where a join occured successfully. Later possibly fill missing values with -1. Joining on the chunk itself produces many possible matches thus causing the memory error, if I understand your dataset correctly. – Patryk Kowalski Jan 19 '22 at 12:25
  • So, basically, I will make you understand, I have two dataframes with large datset, before going to merge I have to make same size as above mentioned condition (no need to merge before filtering). So first, kick out the rows from the dataframes which is not in common chunk id. Second, fill rows with -1 at df2 by comparing df1 index( comparison condition: here at L2 missing index ). Later both the size of data is same now. So now merge or concate it. – iamniki Jan 19 '22 at 12:26
  • Yes, at the end no need to merge with any inner or outer condition. At the end it will be same size. So easily I can concatinate with one index value. – iamniki Jan 19 '22 at 12:28
  • You just need a left join then, filtering will occur during joining. – Patryk Kowalski Jan 19 '22 at 12:29