When you create a DataFrame
like you did, assigning a list to an index, the name of the index will always be None, a <class NoneType> object. The only time the name of the index will be different is if you assign a pd.Series
object to an index, whose name is different from "index".
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
print(df.index.name) # -> 'None'
# You need to specify name otherwise it will default to None, <class NoneType>
index = pd.Series(data=[1,1,1,2,3,3], name='INDEX_NAME')
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)
print(df.index.name) # -> 'INDEX_NAME'
Now to get back to your question, when you create a DataFrame from a csv, you specify an index_col
, if that index_col
has a name, that will be the index name. There might be no name, just an empty string in a csv, then it will have no name, it will be None
. If you do not specify the 'index_col', there will be no name again, it will be None
, and None
is not a string, it is <class 'NoneType'>
'
Example:
csv_string = ',A,B,C\n0,1,2,3\n1,4,5,6\n2,7,8,9'
# Without specifying 'index_col' parameter
df = pd.read_csv(io.StringIO(csv_string))
print(df)
'''
Output:
Unnamed: 0 A B C
0 0 1 2 3
1 1 4 5 6
2 2 7 8 9
'''
print(type(df.index.name)) # <class 'NoneType'>
# By specifying index_col
df = pd.read_csv(io.StringIO(csv_string), index_col=0)
print(df)
'''
Output:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
'''
print(type(df.index.name)) # <class 'NoneType'>
# This is because in the first column, on the first row, there is an empty string
# Let's change that to a non-empty string
csv_string = 'index,A,B,C\n0,1,2,3\n1,4,5,6\n2,7,8,9'
df = pd.read_csv(io.StringIO(csv_string), index_col=0)
print(df)
'''
Output:
A B C
index
0 1 2 3
1 4 5 6
2 7 8 9
'''
print(df.index.name, type(df.index.name)) # index <class 'str'>
When you create a DataFrame like you did, or like the example I showed, you will always know the name of the index.
How to do what you want when there is no index name:
- first method (and probably the best)
index = pd.Series(data=[1,1,1,2,3,3])
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)
modified_df = df.reset_index().drop_duplicates(['index', 'code']).set_index('index')
Similar to yours, works because if there is no name, the .reset_index() method will name the column 'index'. There is also inplace parameter in case you want to modify the original variable df instead of returning a copy.
index = pd.Series(data=[1,1,1,2,3,3])
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)
modified_df = df.reset_index().drop_duplicates(['index', 'code'])
modified_df.index = modified_df['index']
modified_df = modified_df.drop(columns=['index'])
Similarly .drop()
method has an inplace parameter incase you want to modify the original. If inplace is true, None is returned, otherwise the copy, so you should not assign the return value to anything when you use inplace parameter.
Note:
After modifying the DataFrame as you want, the df.index.name will have a name even if the original didn't have, and it will be 'index'. You can assign 'None' value to the index name freely if you don't want an index name.