1

I have two data frames:

import pandas as pd

# Column contains column name
df1 = pd.DataFrame({"Column": pd.Series(['a', 'b', 'b', 'c']),
                    "Item": pd.Series(['x', 'y', 'z', 'x']),
                    "Result": pd.Series([3, 4, 5, 6])})

df2 = pd.DataFrame({"a": pd.Series(['x', 'n', 'n']),
                    "b": pd.Series(['x', 'y', 'n']),
                    "c": pd.Series(['x', 'z', 'n'])})

How can I add "Result" to df2 based on the "Item" in the "Column"? Expected dataframe df2 is:

a  b  c  Result
-  -  -  ------
x  x  x  3
n  y  z  4
n  n  n  null  

How can the above question be a duplicate of 3 questions, 2 of which are marked with an 'or' by @smci?

reservoirinvest
  • 1,463
  • 2
  • 16
  • 32
  • This is called 'merge` or 'join'. Specifically a left-join on df2 of df1's `Result` column. And you are trying to join on `df1.Column` == `df2.index` – smci Apr 19 '18 at 06:15
  • 1
    Possible duplicate of [Python Pandas merge only certain columns](https://stackoverflow.com/questions/17978133/python-pandas-merge-only-certain-columns) – smci Apr 19 '18 at 06:17
  • @iDrwish please repost your comment and edit a clarification into the question title and text. To make it clear why this isn't a duplicate question. – smci Apr 19 '18 at 06:19
  • So actually this is a **duplicate of [Pandas merge using dfA column == dfB index](https://stackoverflow.com/questions/42802023/pandas-merge-using-dfa-column-and-dfb-index)** or [Pandas merge on index column](https://stackoverflow.com/questions/45889486/pandas-merge-on-index-column) or similar. I can't change my vote now, but please vote for those. (The only nuance is OP only wants to merge the `Result` column, but that's trivial.) – smci Apr 19 '18 at 06:21
  • @smci can you give the command to do this merge / join pls? – reservoirinvest Apr 19 '18 at 07:32
  • I already did in the questions I cited above (of which this is a duplicate). Did you read them? – smci Apr 19 '18 at 07:41
  • I read but could not understand. There are 3 links given and each has a different dataset. – reservoirinvest Apr 19 '18 at 07:48
  • reservoirinvest, doesn't matter what the indidivual dataset looks like, really. Each of those has a) two dataframes: `df1,df2` (or `dfA,dfB`) b) the column you want to merge on, e.g. `df1.merge(right_index = True, left_on='Column'))` and c) the column you want to slice e.g. `'Result'`. Ok? Please read the answers to each, try them in your Python session and it should become clear. This really is a duplicate and we are obligated to close duplicates to prevent the site being flooded with the same questions over and over again. If you still don't understand, edit your new code into your question – smci Apr 19 '18 at 07:52
  • @smci df1.merge(right_index = True, left_on = 'Column') gives _positional argument error_. df1.merge(df2, right_index = True, left_on = 'Column') gives an empty dataframe. – reservoirinvest Apr 19 '18 at 08:08
  • Ah, sorry, I missed the twists; this is a lot more complicated than at first glance. `df1` is in long-form, it has two entries for 'b'. so it needs to be stacked/unstacked/pivoted into a 3x4 table where 'Column' becomes the index, and the values from 'Item' = 'x'/'y'/'z' give a full 3x3 matrix with NaN for missing values, and the 4th column will be 'Result'. ... – smci Apr 19 '18 at 09:34
  • ...then somehow we use the columns of `df2` (how exactly?) to join `Result` column from the full pivoted version of `df1`. It seems like we use the leftmost value of `df1['x','y','z']` which specifies a valid column name in `df2` to join the corresponding row from `df2` to get its `Result`. Does that sound right? – smci Apr 19 '18 at 09:38
  • It's still unclear which column of df2 are we supposed to use as column-name into df1: the diagonal of df2? the first valid leftmost name in each row of df2? and is 'n' an explicit invalid sentinel (like NaN?), or just a column name which doesn't happen to exist in df1 (that's going to be a pain to code, we would have to try multiple joins and combine their output)? – smci Apr 19 '18 at 10:37

1 Answers1

1

This is a lot more complicated than at first glance. df1 is in long-form, it has two entries for 'b'. So first it needs to be stacked/unstacked/pivoted into a 3x3 table of 'Result' where 'Column' becomes the index, and the values from 'Item' = 'x'/'y'/'z' are expanded to a full 3x3 matrix with NaN for missing values:

>>> df1_full = df1.pivot(index='Column', columns='Item', values='Result')
Item      x    y    z
Column               
a       3.0  NaN  NaN
b       NaN  4.0  5.0
c       6.0  NaN  NaN

(Note the unwanted type-conversion to float, this is because numpy doesn't have NaN for integers, see Issue 17013 in pre-pandas-0.22.0 versions. No problem, we'll just cast back to int at the end.)

Now we want to do df1_full.merge(df2, left_index=True, right_on=??)

But first we need another trick/intermediate column to find the leftmost valid value in df2 which corresponds to a valid column-name from df1; the value n is invalid, maybe we replace it with NaN to make life easier:

>>> df2.replace('n', np.NaN)
     a    b    c
0    x    x    x
1  NaN    y    z
2  NaN  NaN  NaN

>>> df2_nan.columns = [0,1,2]

     0    1    2
0    x    x    x
1  NaN    y    z
2  NaN  NaN  NaN

And we want to successively test df2's columns from L-to-R as to whether their value is in df1_full.columns, similar to Computing the first non-missing value from each column in a DataFrame , except testing successive columns (axis=1). Then store that intermediate column-name into a new column, 'join_col' :

>>> df2['join_col'] = df2.replace('n', np.NaN).apply(pd.Series.first_valid_index, axis=1)

   a  b  c join_col
0  x  x  x        a
1  n  y  z        b
2  n  n  n     None

Actually we want to index into the column-names of df1, but it blows up on the NaN:

>>> df1.columns[ df2_nan.apply(pd.Series.first_valid_index, axis=1) ]

(Well that's not exactly working, but you get the idea.)

Finally we do the merge df1_full.merge(df2, left_index=True, right_on='join_col'). And maybe take the desired column slice ['a','b','c','Result']. And cast Result back to int, or map 'Nan' -> 'null'.

smci
  • 32,567
  • 20
  • 113
  • 146