You have many problems in your code which, as @Uvar mentioned in the comments, indicate that you do not have a good grasp of the fundamentals of python
. I highly recommend that you read one of the many tutorials available for free online for both python
and pandas
.
Let me try to explain some of your errors, fix your code, and try to provide a better solution.
Function Calls (__call__
) vs. Getting Items (__getitem__
)
Parentheses in python ()
refer to function calls. When you write if Ttrain.Age(x) == 0:
, what you want to do is access the x
-th element of the Series
Ttrain.Age
. But the interpreter thinks you want to call the Series
as a function (hence your error message). The correct syntax is to use square brackets []
for indexing: if Ttrain.Age[x] == 0:
Assignment vs. Equality
Double equal signs (==
) is used to test for equality. A single equal (=
) sign is for assignment.
Ttrain.Age_Group[x] == 'Missing'
is testing to see if the x
-th element in the Series
Ttrain.Age_Group
equals the string
'Missing'
. What you meant to write was: Ttrain.Age_Group[x] = 'Missing'
Access before initialization
Putting the above two points together:
for x in range(len(Ttrain)):
if Ttrain.Age[x] == 0:
Ttrain.Age_Group[x] = 'Missing'
...
AttributeError: 'DataFrame' object has no attribute 'Age_Group'
This is because the Age_Group
series does not yet exist, so the interpreter doesn't know what to do. You need to first define this column.
Ttrain['Age_Group'] = None
for x in range(len(Ttrain)):
if Ttrain.Age[x] == 0:
Ttrain.Age_Group[x] = 'Missing'
But doing this will cause the following warning:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
This is a bad thing. I'm not going to go into the details. I'll leave it as an exercise for you to research and understand why.
Putting it all together
Here is a cleaned up version of your code. You will notice that I have also changed the elif
conditions. I'll leave that up to you to think about why.
import pandas as pd
# mock up some dummy data
Ttrain = pd.DataFrame({'Age': range(0, 80, 10)})
# Ttrain['Age_Group'] = None # don't need this since we're using loc
for x in range(len(Ttrain)):
if Ttrain.Age[x] == 0:
Ttrain.loc[x, 'Age_Group'] = 'Missing'
elif Ttrain.Age[x] <= 18:
Ttrain.loc[x, 'Age_Group'] = 'Young'
elif Ttrain.Age[x] <= 40:
Ttrain.loc[x, 'Age_Group'] = 'Adult'
elif Ttrain.Age[x] <= 60:
Ttrain.loc[x, 'Age_Group'] = 'Middle_Aged'
elif Ttrain.Age[x] > 60:
Ttrain.loc[x, 'Age_Group'] = 'Old'
A better way
In general, iterating through a dataframe
is bad. It's slow and in many cases, there's a better way to do what you're trying to do. Sometimes, it's unavoidable but that's not the case in this example. (Look up pandas masks).
import pandas as pd
# create a dummy dataset for this example
Ttrain = pd.DataFrame({'Age': range(0, 80, 10)})
Ttrain.loc[Ttrain.Age == 0, 'Age_Group'] = 'Missing'
Ttrain.loc[(Ttrain.Age > 0) & (Ttrain.Age <= 18), 'Age_Group'] = 'Young'
Ttrain.loc[(Ttrain.Age > 18) & (Ttrain.Age <= 40), 'Age_Group'] = 'Adult'
Ttrain.loc[(Ttrain.Age > 40) & (Ttrain.Age <= 60), 'Age_Group'] = 'Middle_Aged'
Ttrain.loc[(Ttrain.Age > 60), 'Age_Group'] = 'Old'
Final output
This is what the final dataframe looks like:
>>> print(Ttrain)
Age Age_Group
0 0 Missing
1 10 Young
2 20 Adult
3 30 Adult
4 40 Adult
5 50 Middle_Aged
6 60 Middle_Aged
7 70 Old
Good luck.