0

I have two pandas.DataFrames with overlapping columns and indices, like

X = pandas.DataFrame({"A": ["A0", "A1", "A2"], "B": ["B0", None, "B2"]},
                     index=[0, 1, 2])
Y = pandas.DataFrame({"A": [V, "A3"], "B": ["B1", "B3"], "C": ["C1", "C3"]},
                     index=[1, 3])

I would like to extend X by the values in Y, whereever data is missing, keeping the same columns. That is

  1. if V=="A1" or pandas.isnull(V), I'd like to obtain

    >>> X.fill_from(Y)
        A     B
    0  A0    B0
    1  A1    B1
    2  A2    B2
    3  A3    B3
    

    The value B1 has been filled from Y because the previous value, None, is a null value in pandas. Row 3 has been added because all values in that row were not given in X, because X had no such row.

  2. If V!="A1", I want to get an exception raised concerning the fact that the data frames contain incompatible data.

If I was sure my data had no missing data, pandas.concat((X, Y), join_axes=[X.columns]) would do the extension, and DataFrame.index.get_duplicates() would tell me if there were mis-matching rows.

The hard part is making sure that missing data is not taken to be different from present data, but can be filled in, and I don't see how to do it without iterating over every possible pair in get_duplicates() and copying data manually.

This question with a similar title is not really related. Using X[X.isnull()] = Y, as in this other question, does not work with the get_duplicates() mis-matching check.

Community
  • 1
  • 1
Anaphory
  • 6,045
  • 4
  • 37
  • 68
  • 1
    Just skim-read, but have you tried `combine_first`? – IanS Jun 13 '16 at 14:07
  • `combine_first` is looking very good for matching data, thank you. Now I need to see how to get errors thrown for non-matching data (because comparing data frames is not obvious either). – Anaphory Jun 13 '16 at 14:07
  • As @IanS suggested you can call `combine_first` `X.combine_first(Y)[['A','B']]` to achieve what you want – EdChum Jun 13 '16 at 14:10
  • If you `merge` with `indicator=True` then you can see whether rows are left/right/both `X.merge(Y, how='outer', indicator=True)` – EdChum Jun 13 '16 at 14:12

1 Answers1

0

The combine_first method is half the deal, thanks to @IanS for pointing it out.

>>> X.combine_first(Y)[list(X.columns)]
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3

Now, if V is nice, we should get the same result when combine_firsting in the other direction, otherwise we will get something different. And because NaNs do not compare nicely, the whole function is

def combine_first_if_matching(X, Y):
    filled = X.combine_first(Y)[list(X.columns)]
    reverse_filled = Y.combine_first(X)[list(X.columns)]
    if ((filled == reverse_filled) | (filled.isnull())).all().all():
        return filled
    else:
        raise ValueError("Overlap of data frames did not match")
Anaphory
  • 6,045
  • 4
  • 37
  • 68