1

I frequently go back and forth between using dataframe object attributes to refer to columns as well as using the bracket method.

I am wondering which format is considered "best practice," and if there are any performance differences between the two (or could this potentially vary based upon the circumstance?). I am not finding many resources on this subject.

Here's a simplistic example of what I mean: creating the column "green," with rows being True if columns "blue" and "yellow" are True, otherwise the rows are false.

# using brackets.
df['green'] = np.where((df['blue']==True) & (df['yellow']==True), True, False)

vs.

# using periods.
df['green'] = np.where((df.blue==True) & (df.yellow == True), True, False)

I often find myself using the latter as it looks cleaner, is shorter, and is easier to type. However, I often see pandas examples here and other sources using both methods.

  • Is there a performance difference in using either format?
  • Which format is considered best practice?
Nate
  • 136
  • 10
  • The most common issue I've seen on here: Consider a dataframe `df` with columns `df['name']` and `df['shape']`. Saving 4 characters of typing will _really_ screw you up when you start setting /changing the dataframe's `name` attribute instead of making changes to the `'name'` column, or throwing errors when trying to change the default behavior of `df.shape`. IMO always always use the brackets for column access – G. Anderson Mar 22 '22 at 22:24
  • See also [What is the difference between using squared brackets or dot to access a column?](https://stackoverflow.com/questions/41130255/what-is-the-difference-between-using-squared-brackets-or-dot-to-access-a-column) – G. Anderson Mar 22 '22 at 22:25

2 Answers2

3

There is no performance difference between the 2 notations:

  • df.blue uses __getattr__ to lookup the right column
  • df['blue'] uses __getitem__ to lookup the right column (or index)

You need to have a valid python identifier if you want to use the first form and you can't use column name like shape, size, values and so on.

The second form is more explicit and it used by the LocIndexer. It allows you to use column name like 2022 or Energy (KWH). I clearly prefer this notation.

Corralien
  • 109,409
  • 8
  • 28
  • 52
2

If performance matters, don't use where or similar costly function. A classic mask will do the job. Using timeit can give you an idea about your time consumption :

import pandas as pd
import numpy as np
n = 100
df = pd.DataFrame({'yellow' : np.random.randint(0, 2, n),
                   'blue' : np.random.randint(0, 2, n)}, dtype = np.bool8)

%timeit np.where((df['blue']==True) & (df['yellow']==True), True, False)
252 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit np.where((df.blue==True) & (df.yellow == True), True, False)
245 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df['blue'] & df['yellow']
72.1 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit df.blue & df.yellow
77.1 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In term of performance it's quite equivalent, and statically, you can't differentiate the two approach. In fact, in a costly implementation (as where for instance), the real bottleneck are not on how to access element.

Regarding the syntax, I prefer using .loc or .iloc to access elements since I find it more "pandas-ic", but that's totally up to you.

Zelemist
  • 642
  • 3
  • 14