1

Please note I have already reviewed this link

Pandas and python: deduplication of dataset by several fields*

Update 18-Jul: My perspective is that all of these solutions point to just avoiding indices until after all the de-duplication has performed. Thank you to all who have replied so far.*

I would like to have only one unique value of field code per value of id.

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df.index.name='id'

df:

id code
1 A
1 A
1 B
2 C
3 D
3 A

My desired output is:

id code
1 A
1 B
2 C
3 D
3 A

I managed to accomplish this as follows, but I don't love it.

i=df.index.name
df.reset_index().drop_duplicates().set_index(i)

Here's why:

  • This will fail if the index has no name
  • I shouldn't need to re-set and set an index
  • This is a fairly common operation, and there is way too much ink here.

What I want to say is:

df.groupby('id').drop_duplicates()

Which is, currently, not supported.

Is there a more Pythonic way to do this?

Michael Tuchman
  • 332
  • 3
  • 12
  • 1
    `df.drop_duplictes(['A','B'])`? – Quang Hoang Jul 14 '23 at 14:37
  • If this is your answer, make it a suggested answer, not a comment. This will help other users. Also, check the spelling of all method names. It's not at all clear to me how the symbols 'A' and 'B' in your answer relate to the symbols in my answer 'ID' and 'code'.. Is 'A' allowed to be an index? That said, your answer didn't work: df.drop_duplicates(['id','code']) gives this error: KeyError: Index(['id'], dtype='object') – Michael Tuchman Jul 14 '23 at 16:25
  • 2
    Is there a reason you need 'id' to be the index? I would normally keep the index unique, and then, as you've discovered, using 'id' as a normal column is easier – AFRobertson Jul 16 '23 at 19:51
  • @AFRobertson : tbh, not a hardcore requirement; I like how the data set displays when `id` is the index. – Michael Tuchman Jul 16 '23 at 21:39
  • I realized, after reading the pandas 2.0.3 documentation for value_counts, that this would also work: df.groupby('id').value_counts().reset_index('code').drop('count',axis=1) The main advantage to me is that it uses groupby. – Michael Tuchman Jul 18 '23 at 20:51

6 Answers6

3

To effectively drop duplicates using .groupby, it's sufficient to specify that only the first row should be kept:

from pandas import DataFrame

df = DataFrame({"code": ["A", "A", "B", "C", "D", "A"]}, index=[1, 1, 1, 2, 3, 3])
deduped = df.groupby(by=["code", df.index]).head(1)
print(deduped)
#   code
# 1    A
# 1    B
# 2    C
# 3    D
# 3    A

This answer is based on this answer, which also suggests a couple of additional alternatives.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
2

When you create a DataFrame like you did, assigning a list to an index, the name of the index will always be None, a <class NoneType> object. The only time the name of the index will be different is if you assign a pd.Series object to an index, whose name is different from "index".

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
print(df.index.name) # -> 'None'

# You need to specify name otherwise it will default to None, <class NoneType>
index = pd.Series(data=[1,1,1,2,3,3], name='INDEX_NAME')
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)
print(df.index.name) # -> 'INDEX_NAME'

Now to get back to your question, when you create a DataFrame from a csv, you specify an index_col, if that index_col has a name, that will be the index name. There might be no name, just an empty string in a csv, then it will have no name, it will be None. If you do not specify the 'index_col', there will be no name again, it will be None, and None is not a string, it is <class 'NoneType'>'

Example:

csv_string = ',A,B,C\n0,1,2,3\n1,4,5,6\n2,7,8,9'

# Without specifying 'index_col' parameter
df = pd.read_csv(io.StringIO(csv_string))
print(df)
'''
Output:

   Unnamed: 0  A  B  C
0           0  1  2  3
1           1  4  5  6
2           2  7  8  9
'''
print(type(df.index.name)) # <class 'NoneType'>

# By specifying index_col
df = pd.read_csv(io.StringIO(csv_string), index_col=0)
print(df)
'''
Output:

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
'''
print(type(df.index.name)) # <class 'NoneType'>
# This is because in the first column, on the first row, there is an empty string

# Let's change that to a non-empty string
csv_string = 'index,A,B,C\n0,1,2,3\n1,4,5,6\n2,7,8,9'

df = pd.read_csv(io.StringIO(csv_string), index_col=0)
print(df)
'''
Output:

       A  B  C
index         
0      1  2  3
1      4  5  6
2      7  8  9
'''
print(df.index.name, type(df.index.name)) # index <class 'str'>

When you create a DataFrame like you did, or like the example I showed, you will always know the name of the index.


How to do what you want when there is no index name:

  • first method (and probably the best)
index = pd.Series(data=[1,1,1,2,3,3])
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)

modified_df = df.reset_index().drop_duplicates(['index', 'code']).set_index('index')

Similar to yours, works because if there is no name, the .reset_index() method will name the column 'index'. There is also inplace parameter in case you want to modify the original variable df instead of returning a copy.

  • second method
index = pd.Series(data=[1,1,1,2,3,3])
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)

modified_df = df.reset_index().drop_duplicates(['index', 'code'])
modified_df.index = modified_df['index']
modified_df = modified_df.drop(columns=['index'])

Similarly .drop() method has an inplace parameter incase you want to modify the original. If inplace is true, None is returned, otherwise the copy, so you should not assign the return value to anything when you use inplace parameter.


Note: After modifying the DataFrame as you want, the df.index.name will have a name even if the original didn't have, and it will be 'index'. You can assign 'None' value to the index name freely if you don't want an index name.

luzede
  • 165
  • 2
  • 7
  • the nice thing about your solution is that I don't have to name the index if I don't want to. That's what my question was actually about. I felt that having to name the index was a bit artificial. However, I still think the reset/drop/set pattern is a bit roundabout, but that's Python's responsibility, not yours :-) Thanks much! – Michael Tuchman Jul 16 '23 at 21:36
  • I am glad I helped, I was doing Data Science course this past week and from what I learned setting and resetting index is a big part of it because it helps in getting what you need with less code, because of the label based .loc() method. – luzede Jul 17 '23 at 04:52
2

Here is one way you can achieve the outcome:

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df['id'] = df.reset_index()['index'].values
print(df)
#     code  id
#1    A   1
#1    A   1
#1    B   1
#2    C   2
#3    D   3
#3    A   3

output = df.groupby(by=['code','id']).max().reset_index()
print(output)

#  code  id
#0    A   1
#1    A   3
#2    B   1
#3    C   2
#4    D   3

gaurav
  • 109
  • 3
1

To add to your current approach:

  1. The unnamed index will be converted to column name "index" after reset_index
  2. As a second step, you can set the index as the first column

here's a sample:

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df.reset_index()
df = df[~df.duplicated(keep="first")]
df = df.set_index(df.iloc[:,0])
#....or
df = df.set_index(df.columns.to_list[0])
mrtig
  • 2,217
  • 16
  • 26
1

This is a shorter version of the first option provided by @luzede:

from pandas import DataFrame

df = DataFrame({"id": [1, 1, 1, 2, 3, 3], "code": ["A", "A", "B", "C", "D", "A"]})
deduped = df.drop_duplicates(subset=["id", "code"])
print(deduped)
#   code  id
# 0    A   1
# 2    B   1
# 3    C   2
# 4    D   3
# 5    A   3

Note that for simplicity the dataframe is constructed using "id" as a separate column (this yields the same result as the index-based method in the code snippet in the question).

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
1
import pandas as pd

df = pd.DataFrame({'code': ['A', 'A', 'B', 'C', 'D', 'A']}, index=[1, 1, 1, 2, 3, 3])
df.index.name = 'id'
df = df.drop_duplicates(keep='first')

print(df)
Hasan Raza
  • 424
  • 3
  • 9