2

I have a DataFrame df, and a dict d, like so:

>>> df
   a   b
0  5  10
1  6  11
2  7  12
3  8  13
4  9  14
>>> d = {6: 22, 8: 26}

For every (key, val) in the dictionary, I'd like to find the row where column a matches the key, and override its b column with the value. For example, in this particular case, the value of b in row 1 will change to 22, and its value on row 3 will change to 26.

How should I do that?

Dun Peal
  • 16,679
  • 11
  • 33
  • 46

1 Answers1

2

Assuming it would be OK to propagate the new values to all rows where column a matches (in the event there were duplicates in column a) then:

for a_val, b_val in d.iteritems():
    df['b'][df.a==a_val] = b_val

or to avoid chaining assignment operations:

for a_val, b_val in d.iteritems():
    df.loc[df.a==a_val, 'b'] = b_val

Note that to use loc you must be working with Pandas 0.11 or newer. For older versions, you may be able to use .ix to prevent the chained assignment.

@Jeff pointed to this link which discusses a phenomenon that I had already mentioned in this comment. Note that this is not an issue of correctness, since reversing the order of access has a predictable effect. You can see this easily, e.g. below:

In [102]: id(df[df.a==5]['b'])
Out[102]: 113795992

In [103]: id(df['b'][df.a==5])
Out[103]: 113725760

If you get the column first and then assign based on indexes into that column, the changes effect that column. And since the column is part of the DataFrame, the changes effect the DataFrame. If you index a set of rows first, you're now no longer talking about the same DataFrame, so getting the column from the filtered object won't give you a view of the original column.

@Jeff suggests that this makes it "incorrect" whereas my view is that this is the obvious and expected behavior. In the special case when you have a mixed data type column and there is some type promotion/demotion going on that would prevent Pandas from writing a value into the column, then you might have a correctness issue with this. But given that loc is not available until Pandas 0.11, I think it's still fair to point out how to do it with chained assignment, rather than pretending like loc is the only thing that could possibly ever be the correct choice.

If any one can provide more definitive reasons to think it is "incorrect" (as opposed to just not preferring this stylistically), please contribute and I will try to make a more thorough write-up about the various pitfalls.

Community
  • 1
  • 1
ely
  • 74,674
  • 34
  • 147
  • 228
  • 3
    chaining assignment is not the correct way to do this. ``df.loc[df.a==a_val,'b'] = b_val`` better – Jeff Oct 01 '13 at 20:58
  • 1
    Thank you for the insightful comment. In general, this is one of my biggest gripes with Pandas. Using alternate functions with assignment syntax might be suboptimal design. At least with `[]`, there's no confusion. That operation is purely for getting and setting. I dislike it that an entirely additional function, `loc` (or even `ix` honestly), subsumes that functionality. It hides the fact that it's a function entirely predicated on a side-effect, whereas in most of the rest of Python, `__getitem__` and `__setitem__` are the standards for absorbing that side-effect impurity. – ely Oct 01 '13 at 21:05
  • Thanks EMS and Jeff. Can either of you explain what's the problem with "chained assignment"? – Dun Peal Oct 01 '13 at 21:12
  • There are many reasons why chaining together assignments or accesses is bad. One reason is that it obfuscates the intent of the code. The person reading my first answer must understand that `['b']` obtains the column of `df` essentially by reference, so that the next thing, `[df.a==a_val]` (getting some part of that and assigning into it) actually modifies the data that `df` looks at. It adds that extra layer of inferential distance between the result of the operation and its readability. – ely Oct 01 '13 at 21:15
  • The problem with the way Pandas chooses to address this is that it mixes up the `[]` notation (which has come to mean *to access*) with function call notation `()`. The obfuscation saved by not needing to reason step-wise about each iterative get or set is replaced by obfuscation that `loc` is not itself a data object but rather a function that pretends not to behave like a callable and acts like it is gettable or settable. – ely Oct 01 '13 at 21:17
  • @EMS: so, essentially, this is about readability? There won't be any performance or correctness implications? If so, I fail to to see much of a difference between the first and second snippets. Both use non-standard slicing, and neither would make sense to a Python programmer unfamiliar with the very special and creative way Pandas uses `__getitem__`. – Dun Peal Oct 01 '13 at 21:23
  • There's no direct issue of correctness, but one pain about the chained assignment approach is that if you reverse the order, and instead do: `df[df.a==a_val]['b']` the assignment won't work (since the `b` column of that filtered data frame is not the same object as the `b` column of the unfiltered data frame). If the context of your code makes it easy to flip the order of those operations around, you might make it easy to have bugs. But otherwise, no there is no correctness issue. In fact, the first method was standard before some newer releases of Pandas. – ely Oct 01 '13 at 21:28
  • @EMS thanks for the detailed explanation! Final question: in some cases `d` will be very large, and I'd rather avoid iterating over it in Python. Is there a way to vectorize this? I think I can use `Series.update()` for this, but how do I efficiently convert `d` to a Series with an index corresponding to `df`? – Dun Peal Oct 01 '13 at 21:32
  • See [here](http://pandas.pydata.org/pandas-docs/dev/indexing.html#returning-a-view-versus-a-copy) for why chained indexing can work but in general is not guaranteed. @EMS You can have your viewpoint, but you have made several errors: the behavior of chained setting has not changed at all since as far back as I can remember. It has always been at the mercy of numpy views and python syntax. Secondly, the reason to avoid chained index is that it *sometimes* will fail silent most notably, but not exclusively in mixed-dtype structures. – Jeff Oct 01 '13 at 22:39
  • @EMS that's not entirely correct. For example, there's no way to express the `:` (slice) in a clear way in a function call. Moreover, the point of allowing operators to be overriden is to be able to allow expressive and high level code. Fundamentally everything with the getitem syntax in pandas is about setting or getting values. However, when you have a multidimensional object with labeled axes and rows, you necessarily trade some complexity for ease of manipulating data – Jeff Tratner Oct 01 '13 at 23:26
  • @EMS: would you mind removing the first snippet from your answer? From Jeff's comment, it seems like the snippets aren't actually equivalen. – Dun Peal Oct 01 '13 at 23:40
  • Please see my revisions. I had already mentioned @Jeff's point in an earlier comment and it is unrelated to correctness since the result of the chained assignment is entirely predictable. Reversing the order has a well-defined meaning that would lead one to expect the assignment would not persist if you filtered rows first and then assigned into a column. I'm not sure I understand why this is being called a "correctness" issue. – ely Oct 02 '13 at 13:12
  • Also, @Jeff, I did not claim that the behavior of chained assignments has ever changed in Pandas. I claimed that the use if `.loc` for this is new, and that in fact the chained assignment (which has been available all the way back) used to be the 'right' way to do it in older versions. – ely Oct 02 '13 at 13:13
  • Lastly, either way does not matter to me. If there's more definitive examples for mixed datatype columns, I'm happy to endorse `loc` as the preferred method. But, for example, I work in a production environment where we are still using Pandas 0.8. And we will be using it for at least another year because of the rules about how we can modify our production system. Therefore, regardless of what is the official, newest-release-endorsed, 'correct' method, it's still useful to point out other ways to solve something. Using `loc` is not even possible for me except on our research system. – ely Oct 02 '13 at 13:21