Creating a column in Python

Question

I'am trying to create categories for a column Age, with the following python code:

for x in range(len(Ttrain.Age)):
       if Ttrain.Age(x) == 0:
             Ttrain.Age_Group(x) == 'Missing'
        elif 0 < Ttrain.Age(x) <= 18:
             Ttrain.Age_Group(x) == 'Young'
        elif 19 < Ttrain.Age(x) <= 40:
             Ttrain.Age_Group(x) == 'Adult'
        elif 41 < Ttrain.Age(x) <= 60:
             Ttrain.Age_Group(x) == 'Middle_Aged'
        elif Ttrain.Age(x) > 60:
             Ttrain.Age_Group(x) == 'Old'
 % (x)

Such that the Age_Group column will contain the categories. But I'm getting the following error:

TypeError: 'Series' object is not callable

Please note that I have replaced all the missing values in age column with 0.

It eludes me why you would try and tackle a learning problem when the basics of the language are not understood. That is bound to keep running into problems. As for two hints I can give you right off the bat: the error tells you what is wrong. You invoke the `__call__` method by using `( )` parentheses, which is not valid for a `Series` object. Using square brackets will let you access indices. A second hint is that you are not setting anything. All of the `if` clauses result in another comparison which will not change anything in your `Ttrain.Age_Group` Series. — Uvar, Jan 12 '18 at 15:18

score 1 · Accepted Answer · answered Jan 12 '18 at 17:18

You have many problems in your code which, as @Uvar mentioned in the comments, indicate that you do not have a good grasp of the fundamentals of python. I highly recommend that you read one of the many tutorials available for free online for both python and pandas.

Let me try to explain some of your errors, fix your code, and try to provide a better solution.

Function Calls (`call`) vs. Getting Items (`getitem`)

Parentheses in python () refer to function calls. When you write if Ttrain.Age(x) == 0:, what you want to do is access the x-th element of the Series Ttrain.Age. But the interpreter thinks you want to call the Series as a function (hence your error message). The correct syntax is to use square brackets [] for indexing: if Ttrain.Age[x] == 0:

Assignment vs. Equality

Double equal signs (==) is used to test for equality. A single equal (=) sign is for assignment.

Ttrain.Age_Group[x] == 'Missing' is testing to see if the x-th element in the Series Ttrain.Age_Group equals the string 'Missing'. What you meant to write was: Ttrain.Age_Group[x] = 'Missing'

Access before initialization

Putting the above two points together:

for x in range(len(Ttrain)):
       if Ttrain.Age[x] == 0:
             Ttrain.Age_Group[x] = 'Missing'

...
AttributeError: 'DataFrame' object has no attribute 'Age_Group'

This is because the Age_Group series does not yet exist, so the interpreter doesn't know what to do. You need to first define this column.

Ttrain['Age_Group'] = None
for x in range(len(Ttrain)):
    if Ttrain.Age[x] == 0:
         Ttrain.Age_Group[x] = 'Missing'

But doing this will cause the following warning:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

This is a bad thing. I'm not going to go into the details. I'll leave it as an exercise for you to research and understand why.

Putting it all together

Here is a cleaned up version of your code. You will notice that I have also changed the elif conditions. I'll leave that up to you to think about why.

import pandas as pd
# mock up some dummy data
Ttrain = pd.DataFrame({'Age': range(0, 80, 10)})
# Ttrain['Age_Group'] = None  # don't need this since we're using loc
for x in range(len(Ttrain)):
    if Ttrain.Age[x] == 0:
         Ttrain.loc[x, 'Age_Group'] = 'Missing'
    elif Ttrain.Age[x] <= 18:
         Ttrain.loc[x, 'Age_Group'] = 'Young'
    elif Ttrain.Age[x] <= 40:
         Ttrain.loc[x, 'Age_Group'] = 'Adult'
    elif Ttrain.Age[x] <= 60:
         Ttrain.loc[x, 'Age_Group'] = 'Middle_Aged'
    elif Ttrain.Age[x] > 60:
         Ttrain.loc[x, 'Age_Group'] = 'Old'

A better way

In general, iterating through a dataframe is bad. It's slow and in many cases, there's a better way to do what you're trying to do. Sometimes, it's unavoidable but that's not the case in this example. (Look up pandas masks).

import pandas as pd

# create a dummy dataset for this example
Ttrain = pd.DataFrame({'Age': range(0, 80, 10)})

Ttrain.loc[Ttrain.Age == 0, 'Age_Group'] = 'Missing'
Ttrain.loc[(Ttrain.Age > 0) & (Ttrain.Age <= 18), 'Age_Group'] = 'Young'
Ttrain.loc[(Ttrain.Age > 18) & (Ttrain.Age <= 40), 'Age_Group'] = 'Adult'
Ttrain.loc[(Ttrain.Age > 40) & (Ttrain.Age <= 60), 'Age_Group'] = 'Middle_Aged'
Ttrain.loc[(Ttrain.Age > 60), 'Age_Group'] = 'Old'

Final output

This is what the final dataframe looks like:

>>> print(Ttrain)

   Age    Age_Group
0    0      Missing
1   10        Young
2   20        Adult
3   30        Adult
4   40        Adult
5   50  Middle_Aged
6   60  Middle_Aged
7   70          Old

Good luck.