1

The follows on from: Pandas - creating 2 new columns based on 2 columns and a separate test column

But it's a different question in it's own right. It should be simpler!

In the referenced question the following one-liner is discussed for data-filling 2 new columns from 2 other columns, and dependent on the value of a third column:

df['Buyer ID'], df['Seller ID'] = zip(
    *np.where(df.buy_sell == 'Buy',
             (df.buyer_name,df.seller_name), 
             (df.seller_name,df.buyer_name)).T)

This works well - but when I try to simplify this to use fixed scalar values rather than corresponding values in other columns, it doesn't work.

For example, if I only have one possible buyer, John, and one possible Seller, Maggie, then the follow simpler construct should suffice:

df['Buyer ID'], df['Seller ID'] = zip(
    *np.where(df.buy_sell == 'Buy',
             ("John","Maggie"), 
             ("Maggie","John")).T)

This is failing on the inner np.where() call with:

operands could not be broadcast together with shapes

I've tried a few variations like wrapping the tuples in zip(), which changes the shape but I still get the error. I think the problem is that ("John","Maggie") is not returned as the contents of a single column. The tuple is expanded to mean >1 column?

This link also showed some promise: Changing certain values in multiple columns of a pandas DataFrame at once

But I think the solution assumes the columns you wish to edit already exist and that you only want the same single value placed in every column.

I can get around the problem by making several passes, but it's not ideal:

np.where(df.buy_sell == 'Buy', 'John', 'Maggie') 

Ideally for each row, I want a single-pass solution extendible to N new columns being filled with different, fixed, default values, but all depending on a single (boolean) value in another column.

Any pointers on what I'm missing?

Phil
  • 592
  • 6
  • 15

2 Answers2

3

I think you need expand mask to 2d array, because need 2 new columns by numpy.column_stack:

df = pd.DataFrame({'buy_sell': ['Buy','Buy','Buy','Sell','Sell']})

m = df.buy_sell == 'Buy'
mask = np.column_stack([m] * 2)
df1 = pd.DataFrame(np.where(mask, ("John","Maggie"), ("Maggie","John")))
df[['Buyer ID', 'Seller ID']] = df1
print (df)
  buy_sell Buyer ID Seller ID
0      Buy     John    Maggie
1      Buy     John    Maggie
2      Buy     John    Maggie
3     Sell   Maggie      John
4     Sell   Maggie      John

EDIT:

After investigate original solution is possible broadcast boolean mask, only need [:, None] for N x 1 array:

m = df.buy_sell == 'Buy'
df1 = pd.DataFrame(np.where(np.array(m)[:, None], ("John","Maggie"), ("Maggie","John")))
df[['Buyer ID', 'Seller ID']] = df1
print (df)
  buy_sell Buyer ID Seller ID
0      Buy     John    Maggie
1      Buy     John    Maggie
2      Buy     John    Maggie
3     Sell   Maggie      John
4     Sell   Maggie      John

Detail:

print (np.array(m)[:, None])

[[ True]
 [ True]
 [ True]
 [False]
 [False]]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks - this works and makes sense - the shape of the boolean-index must match the items returned. But... I'm still not sure why we don't need to stack in the original case i.e. why is (df.buyer_name,df.seller_name) broadcastable, when ("John","Maggie") is not? Why is .T used to transpose the original answer - I suspect this is part of the magic? – Phil Dec 19 '17 at 18:14
  • 1
    I investigate it and find better solution with `N, 1` array of boolean. Please check it. – jezrael Dec 20 '17 at 08:47
0

The answer from jezrael gives a perfectly good way of doing this. But to explain why only the first example works in the original question, I found the links below a useful reference:

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html and https://eli.thegreenplace.net/2015/broadcasting-arrays-in-numpy/

I've applied the references to the case in hand.

To recap:

First case - this works:

np.where(df.buy_sell == 'Buy',(df.buyer_name,df.seller_name),(df.seller_name,df.buyer_name))

Second case - this doesn't work:

np.where(df.buy_sell == 'Buy',("John","Maggie"), ("Maggie","John"))

Third case - this does work:

np.where(df.buy_sell == 'Buy', 'John', 'Maggie') 

What (I think!) is happening in the first case is an attempt to broadcast across:

(n,) (n,) (n,) - this is fine because all non-zero dimensions equal

In the second case it is

(n,) (2,) (2,) - this is not OK because non-equal dimensions eg n<>2 - the nature of tuples is that that they are (2,) and clash with the (n,) of the buy_sell.

In the final case it is

(n,) (1,) (1,) - this is the same as the above, however this works because you can stretch a 1 over n, so it doesn't clash.

So to construct something that works for the scalar case we need to alter the tuples:

(n,) (2,) (2,)

To avoid a mismatch we changed it to:

(n,) (2,1) (2,1)

Now this isn't obvious, but what numpy will automagically do to broadcast this is left-pad the (n,) to (1,n), giving us:

(1,n) (2,1) (2,1)

This way there are no mismatched dimensions >1, giving a broadcast object of (2,n) - 2 rows of n columns each. You can see this by manually applying the np.broadcast() to the 3 arrays and calling shape on the result.

It's important to understand the difference between (x,) and (x,1) to get how this works. Basically - (x,) only has 1 dimension, (x,1) has 2 dimensions where the 2nd dimension is restricted to a single value. See here for details: Difference between numpy.array shape (R, 1) and (R,)

So the desired result can be achieved using the construct below:

np.where(df.buy_sell == 'Buy', (["John"],["Maggie"]), (["Maggie"],["John"]))

The result is then transposed to get n rows of 2 columns, so each row can be passed as parameters into zip() to allow multiple assignments.

I'm pretty sure jezrael's solution effectively does the same thing, but in this case the buy_sell is given the extra dimension rather than the text outputs - but the same objective is achieved - keeping mismatched >1 dimensions on different axis.

In this case the buy_sell becomes (n,1) so we have

(n,1) (2,) (2,)

Which is left padded to

(n,1) (1,2) (1,2)

Giving a broadcast object of (n,2).

The nice thing about this solution is that no transpose is needed before applying zip().

Phil
  • 592
  • 6
  • 15