Is there a significant difference between data['column_name'] vs data.column_name

Question

For example, I'm studying an example like this:

train['Datetime'] = pd.to_datetime(train.Datetime,format='%d-%m-%Y %H:%M')

If I run train['Datetime'].head() and train.Datetime.head(), the results are identical. So why use one over the other? Or why use both?

Use the dot notation is a shortcut and less reliable that using [' '] notation. The dot notation will not work if you have column headers with whitespace or special characters. IE if you have a dataframe with a column 'Date Time' then you CANNOT use `df.Date Time.head()` you must use `df['Date Time'].head()`. — Scott Boston, Jul 14 '18 at 01:39
Thank you for the response. For some reason I had a hard time googling this. — NimbleTortoise, Jul 14 '18 at 01:41
However, there some advantages to using the dot notation, one is in some development environments such as Jupyter notebook. Using the dot notation will how the code helper to popup all available methods that can be called on a dataframe column. At least in Jupyter notebook this is not available when using [' '] notation. Second could be readability if you are using to programming languages with the dot notation framework. — Scott Boston, Jul 14 '18 at 01:48
I am looking at this from jupyter notebook so maybe that's why they used the dot notation. By "code helper", I'm guessing this is something I have to install on jupyter? — NimbleTortoise, Jul 14 '18 at 02:07
Also if you have a row that overrides any built in method/attribute etc, such as `index`, `values`... you will have to use dictionary notation — user3483203, Jul 14 '18 at 02:19

score 2 · Answer 1 · answered Jul 14 '18 at 02:22

I have used both. I think the most important consideration is about how sustainable and flexible you want your code to be. For quick checks and "imperative programming" (like Jupyter Notebooks), you could use the minimal shorthand:

train.Datetime.head()

However pretty soon you will realize that when you want to pass variables around that may come from a UI or some other source or debug code efficiently, full notation like this:

train['Datetime'].head()

has main benefits, and it is good to make it a habit early on when programming.

First, in Integrated Development Environments (IDE's) used for editing code, the string 'Datetime' will be highlighted to remind you that it is a "hard dependency" in your code. Whereas the Datetime (no quotes, just a .) will not show the highlighting.

This may not sound like a big deal, but when you are looking a 100's of lines of code (or more), seeing where you have "hardcoded" a variable name is important.

The other main advantage of [] notation is that you can pass in string variables to the notation.

import pandas as pd
import numpy as np

# make some data
n=100
df = pd.DataFrame({
    'Fruit': np.random.choice(['Apple', 'Orange', 'Grape'], n),
    'Animal': np.random.choice(['Cat', 'Dog', 'Fish'], n),
    'x1': np.random.randn(n)})

# some name from a user interface.  It could be "Fruit" or "Animal"
group = "Animal"

# use that string variable in an expression (in this case,  as a group by)
df.groupby(group).agg(['count', 'mean', 'std'])

Here, even in Stack overflow, you can see that in the df.groupby() that there are no hardcoded strings (in red text). This sepration of user inputs and code that does something is subtle, but extremely important.

Good luck!

score 1 · Answer 2 · answered Jul 14 '18 at 01:41

1

There will be issue when the column name contain blank spaces, in that case indexing is must.

answered Jul 14 '18 at 01:41

SudipM

416
7
14

Is there a significant difference between data['column_name'] vs data.column_name

2 Answers2