Vectorized column comparision questions in pandas

Question

I'm having some trouble understanding pandas behavior when comparing columns in a dataframe. What I want to do is to apply logical operations to different columns and produce a result column that evaluates to True or False based on the result of the logic. (Logic may also apply to lagged column values produced with .shift(), but I don't think that is necessary for this question.)

The problem is that I understand the comparison df.A < df.B is vectorized (so it's very fast) and should produce an element by element result. This works when I assign to a series, but when I try to assign it to a new column it errors. Here is an example:

df = pd.DataFrame(np.random.randn(10,2),index=(np.arange(10)),columns=['A','B'])
df['C'] = False  # must add column with [] notation rather than .C
a = df.A < df.B
df.C = A
df

This produces the expected output:

          A         B      C
0  1.222631  0.568988  False
1 -0.719666  0.733197   True
2 -2.434720 -0.131745   True
3  0.653228  0.428794  False
4  0.862103  0.402158  False
5 -0.256027 -0.819937  False
6 -1.728418  1.463709   True
7 -1.110928 -2.173016  False
8  0.656576 -1.218179  False
9  0.014519 -0.854039  False

So, continuing and attempting without going through the intermediate step of assigning to a series:

df['C'] = False  # not necessary but a reset
if df.A < df.B: df.C = True

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So the question here is why does this work when I first go through the series, but cannot assign directly to the column? I suspect there is more about this behavior I don't understand.

Let me continue with a related example. I understand that np.where() is probably a cleaner way to do operations like this than with the if statement, but I still run into an issue that points to a lack of understanding. Here are three lines that I would think should be equivalent:

df['C'] = np.where((df.A < 0 & df.B > df.A), True, False)  #1 Errors
df['C'] = np.where((df.A < 0) and (df.B > df.A), True, False)  #2 Errors
df['C'] = np.where((df.A < 0) & (df.B > df.A), True, False)  #3 Works

The difference between #2 and #3 is the and vs &... I suspect there is something bit-wise going on here behind the scenes I do not fully understand. But why does Ex 1 error? The extra parenthesis are not needed, correct? Condition1 & Condition 2 (Condition1) & (Condition2)

Why do those produce different results? More importantly, where is this documented? Just trying to expand my own learning and understanding and learn how to deal with issues like this myself.

Thank you!

score 2 · Accepted Answer · edited May 23 '17 at 11:59

To answer you questions,

Why if df.A < df.B: df.C = True fails?

Ans: I think this is caused by "if". You could consider "if" as a function, which only accepts True/False or something with known True/False evaluation(i.e. if 1, if -1, if 0, if None). You could refer to https://docs.python.org/2/library/stdtypes.html. Basically everything in python is expanding object class. If the class has implementation for nonzero or len, "if" should be working. You could try the following:

>>> x = 1
>>> x.__nonzero__()
True
>>> x = df.C
>>> x.__nonzero__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/generic.py", line 731, in __nonzero__
     .format(self.__class__.__name__))
 ValueError: The truth value of a Series is ambiguous. Use a.empty,      a.bool(), a.item(), a.any() or a.all().

If you still want to know have more details, I will refer you to the pandas source code as https://github.com/pydata/pandas/blob/master/pandas/core/generic.py

Why and does not work comparing with &?

I think it is answered Difference between 'and' (boolean) vs. '&' (bitwise) in python. Why difference in behavior with lists vs numpy arrays?

Why do we need to have bracket?

If you examine the error message,

>>> df.B>df.A & df.A <0
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/Library/Python/2.7/site-packages/pandas/core/ops.py", line 786, in wrapper
    return filler(self._constructor(na_op(self.values, other.values),
   File "/Library/Python/2.7/site-packages/pandas/core/ops.py", line 758, in na_op
    result = lib.vec_binop(x, y, op)
   File "pandas/lib.pyx", line 914, in pandas.lib.vec_binop (pandas/lib.c:16248)
   File "pandas/lib.pyx", line 907, in pandas.lib.vec_binop (pandas/lib.c:16122)
 TypeError: unsupported operand type(s) for &: 'float' and 'bool'

It means in fact the & operator is trying to be used between float and bool. Where is the float and where is the bool? The bool is df.B>df.A, the float is df.A. What does this imply? It means & does not have operator precedence over (<,>), check this https://www.ibiblio.org/swaroopch/byteofpython/read/operator-precedence.html

On the other hand, "and" has operator precedence over (<,>), thus and working.

To dig it further by yourself, I believe checking the source code will be a good way. Hope it answers your questions.

Ok thank you. That's helpful. I guess I assumed that C = A < B and if A < B: C = True were equivalent statements, but I see I need to work a bit more to understand the distinctions. Thank you. — user5747140, Jan 10 '16 at 18:16

Vectorized column comparision questions in pandas

1 Answers1