I'm having some trouble understanding pandas behavior when comparing columns in a dataframe. What I want to do is to apply logical operations to different columns and produce a result column that evaluates to True or False based on the result of the logic. (Logic may also apply to lagged column values produced with .shift(), but I don't think that is necessary for this question.)
The problem is that I understand the comparison df.A < df.B is vectorized (so it's very fast) and should produce an element by element result. This works when I assign to a series, but when I try to assign it to a new column it errors. Here is an example:
df = pd.DataFrame(np.random.randn(10,2),index=(np.arange(10)),columns=['A','B'])
df['C'] = False # must add column with [] notation rather than .C
a = df.A < df.B
df.C = A
df
This produces the expected output:
A B C
0 1.222631 0.568988 False
1 -0.719666 0.733197 True
2 -2.434720 -0.131745 True
3 0.653228 0.428794 False
4 0.862103 0.402158 False
5 -0.256027 -0.819937 False
6 -1.728418 1.463709 True
7 -1.110928 -2.173016 False
8 0.656576 -1.218179 False
9 0.014519 -0.854039 False
So, continuing and attempting without going through the intermediate step of assigning to a series:
df['C'] = False # not necessary but a reset
if df.A < df.B: df.C = True
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So the question here is why does this work when I first go through the series, but cannot assign directly to the column? I suspect there is more about this behavior I don't understand.
Let me continue with a related example. I understand that np.where() is probably a cleaner way to do operations like this than with the if statement, but I still run into an issue that points to a lack of understanding. Here are three lines that I would think should be equivalent:
df['C'] = np.where((df.A < 0 & df.B > df.A), True, False) #1 Errors
df['C'] = np.where((df.A < 0) and (df.B > df.A), True, False) #2 Errors
df['C'] = np.where((df.A < 0) & (df.B > df.A), True, False) #3 Works
The difference between #2 and #3 is the and vs &... I suspect there is something bit-wise going on here behind the scenes I do not fully understand. But why does Ex 1 error? The extra parenthesis are not needed, correct? Condition1 & Condition 2 (Condition1) & (Condition2)
Why do those produce different results? More importantly, where is this documented? Just trying to expand my own learning and understanding and learn how to deal with issues like this myself.
Thank you!