0

I am trying to do the same as this answer, but with the difference that I'd like to ignore NaN in some cases. For instance:

#df1
     c1    c2    c3
0    a     b     1
1    a     c     2
2    a     nan   1
3    b     nan   3
4    c     d     1
5    d     e     3

#df2
     c1    c2    c4
0    a     nan   1
1    a     c     2
2    a     x     1
3    b     nan   3
4    z     y     2

#merged output based on [c1, c2], dropping instances 
#with `NaN` unless both dataframes have `NaN`.

     c1    c2    c3    c4
0    a     b     1     1   #c1,c2 from df1 because df2 has a nan in c2
1    a     c     2     2   #in both
2    a     x     1     1   #c1,c2 from df2 because df1 has a nan in c2
3    b     nan   3     3   #c1,c2 as found in both
4    c     d     1     nan #from df1
5    d     e     3     nan #from df1
6    z     y     nan   2   #from df2

NaNs may come from either c1 or c2, but for this example I kept it simpler.

I'm not sure what's the cleanest way to do this. I was thinking to merge based on [c1,c2], and then loop by rows with nan, but this will not be so direct. Do you see a better way to do it?

Edit - clarifying conditions
1. No duplicates are found anywhere.
2. No combination is performed between two rows if they both have values. c1 may not be combined with c2, so order must be respected.
3. For the cases where one of the 2 dfs has a nan in either c1 or c2, find the rows in the other dataframe that don't have a full match on both c1+c2, and use it. For instance:

  • (a,c) has a match in both so it is no longer discussed.
  • (a,b) is only in df1. No b is found in df2.c2. The only row in df2 with a known key and a nan is row 0 so it is combined with this one. Note that order must be respected this is why (a,b) #df1 cannot be combined with any other row of df2 that also contains a b.
  • (a,x) is only in df2. No x is found in df1.c2. The only row in df1 with one of the known keys with a nan is row with index 2.
Sos
  • 1,783
  • 2
  • 20
  • 46
  • `df1.combine_first(df2)` – anky Jul 25 '19 at 11:47
  • If your data is alligned as your example data, you can simply use: `df1['c2'] = df1['c2'].fillna(df2['c2'])` – Erfan Jul 25 '19 at 11:47
  • Let me know if your data is alligned as you show in your question. If this is not the case neither mine or anky's answer will work. I will vote for reopen. – Erfan Jul 25 '19 at 11:49
  • @Erfan in my case, both `c1` and `c2` contain `nan`. I assume in this case anky's answer would work? The rows may be sorted differently (but this would be fixed by `sort_values(['c1','c2'])` – Sos Jul 25 '19 at 11:56
  • 1
    You have to test yourself, the rows have to be aligned as you show in your example dataset. If a sort will fix this, that means your data has the same amount of rows, plus the amount of keysL `a`, `b` etc. – Erfan Jul 25 '19 at 11:59
  • @Erfan I've updated my working example. I think my suggestion won''t work after all. They have very different sizes (`df2` is significantly smaller in number or rows), and keys found in one do not necessarily appear at all in the other. – Sos Jul 25 '19 at 12:09
  • My appologies to both you and @anky_91 for not being clear from the outset. – Sos Jul 25 '19 at 12:17
  • 2
    your edit to the question introduces some major additional problems. Your expected output no longer makes sense in terms of a "merge" context, because if you were using merge, both tables have a ('a', nan) row, and shouldn't that get paired up? I think you really need to clarify for yourself some restrictions on how this would go, otherwise I don't see you asking for "one solution". There's consistency issues. Simply put, why should the 3rd row of df2 get combined with the 3rd row of df1? The first row of df2 matches much better with df1 – Paritosh Singh Jul 25 '19 at 12:39
  • Hi @ParitoshSingh, well noticed, thanks for your comments. I have added some points to clarify, hope it is clearer now – Sos Jul 25 '19 at 13:32
  • One more point: I supposed your saying `The first row of df2 matches much better with df1` is because of `c3` and `c4`, but these cannot be used as keys to merge. Long-story short: we want to merge rows that have a `nan` with the best (only) match in the other df that doesn't have a `nan` and no correspondence in both columns `c1,c2` – Sos Jul 25 '19 at 13:34

0 Answers0