Deduplicate pandas dataset by index value without using `networkx`

Question

Please note I have already reviewed this link

Pandas and python: deduplication of dataset by several fields*

Update 18-Jul: My perspective is that all of these solutions point to just avoiding indices until after all the de-duplication has performed. Thank you to all who have replied so far.*

I would like to have only one unique value of field code per value of id.

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df.index.name='id'

df:

id	code
1	A
1	A
1	B
2	C
3	D
3	A

My desired output is:

id	code
1	A
1	B
2	C
3	D
3	A

I managed to accomplish this as follows, but I don't love it.

i=df.index.name
df.reset_index().drop_duplicates().set_index(i)

Here's why:

This will fail if the index has no name
I shouldn't need to re-set and set an index
This is a fairly common operation, and there is way too much ink here.

What I want to say is:

df.groupby('id').drop_duplicates()

Which is, currently, not supported.

Is there a more Pythonic way to do this?

If this is your answer, make it a suggested answer, not a comment. This will help other users. Also, check the spelling of all method names. It's not at all clear to me how the symbols 'A' and 'B' in your answer relate to the symbols in my answer 'ID' and 'code'.. Is 'A' allowed to be an index? That said, your answer didn't work: df.drop_duplicates(['id','code']) gives this error: KeyError: Index(['id'], dtype='object') — Michael Tuchman, Jul 14 '23 at 16:25
Is there a reason you need 'id' to be the index? I would normally keep the index unique, and then, as you've discovered, using 'id' as a normal column is easier — AFRobertson, Jul 16 '23 at 19:51
@AFRobertson : tbh, not a hardcore requirement; I like how the data set displays when `id` is the index. — Michael Tuchman, Jul 16 '23 at 21:39
I realized, after reading the pandas 2.0.3 documentation for value_counts, that this would also work: df.groupby('id').value_counts().reset_index('code').drop('count',axis=1) The main advantage to me is that it uses groupby. — Michael Tuchman, Jul 18 '23 at 20:51

score 3 · Answer 1 · answered Jul 18 '23 at 11:25

3

To effectively drop duplicates using .groupby, it's sufficient to specify that only the first row should be kept:

from pandas import DataFrame

df = DataFrame({"code": ["A", "A", "B", "C", "D", "A"]}, index=[1, 1, 1, 2, 3, 3])
deduped = df.groupby(by=["code", df.index]).head(1)
print(deduped)
#   code
# 1    A
# 1    B
# 2    C
# 3    D
# 3    A

This answer is based on this answer, which also suggests a couple of additional alternatives.

answered Jul 18 '23 at 11:25

SultanOrazbayev

14,900
3
16
46

1

I'm not quite sure why my search did not uncover that. I appreciate your answer as well as your bringing that to my attention. Admittedly, this makes my question a duplicate (meta-irony!), but I can't delete it as iI have bountied it – Michael Tuchman Jul 18 '23 at 20:54
1

@sultanrorazbayev : group...head is the solution that I was hoping to find. – Michael Tuchman Jul 18 '23 at 20:56
1

It's an old question, so perhaps it's not prioritized in search result listings... but glad it worked for you! – SultanOrazbayev Jul 18 '23 at 23:52
1

"It's an old code, but it checks out" :-) – Michael Tuchman Jul 19 '23 at 19:59

score 2 · Accepted Answer · answered Jul 16 '23 at 20:49

When you create a DataFrame like you did, assigning a list to an index, the name of the index will always be None, a <class NoneType> object. The only time the name of the index will be different is if you assign a pd.Series object to an index, whose name is different from "index".

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
print(df.index.name) # -> 'None'

# You need to specify name otherwise it will default to None, <class NoneType>
index = pd.Series(data=[1,1,1,2,3,3], name='INDEX_NAME')
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)
print(df.index.name) # -> 'INDEX_NAME'

Now to get back to your question, when you create a DataFrame from a csv, you specify an index_col, if that index_col has a name, that will be the index name. There might be no name, just an empty string in a csv, then it will have no name, it will be None. If you do not specify the 'index_col', there will be no name again, it will be None, and None is not a string, it is <class 'NoneType'>'

Example:

csv_string = ',A,B,C\n0,1,2,3\n1,4,5,6\n2,7,8,9'

# Without specifying 'index_col' parameter
df = pd.read_csv(io.StringIO(csv_string))
print(df)
'''
Output:

   Unnamed: 0  A  B  C
0           0  1  2  3
1           1  4  5  6
2           2  7  8  9
'''
print(type(df.index.name)) # <class 'NoneType'>

# By specifying index_col
df = pd.read_csv(io.StringIO(csv_string), index_col=0)
print(df)
'''
Output:

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
'''
print(type(df.index.name)) # <class 'NoneType'>
# This is because in the first column, on the first row, there is an empty string

# Let's change that to a non-empty string
csv_string = 'index,A,B,C\n0,1,2,3\n1,4,5,6\n2,7,8,9'

df = pd.read_csv(io.StringIO(csv_string), index_col=0)
print(df)
'''
Output:

       A  B  C
index         
0      1  2  3
1      4  5  6
2      7  8  9
'''
print(df.index.name, type(df.index.name)) # index <class 'str'>

When you create a DataFrame like you did, or like the example I showed, you will always know the name of the index.

How to do what you want when there is no index name:

first method (and probably the best)

index = pd.Series(data=[1,1,1,2,3,3])
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)

modified_df = df.reset_index().drop_duplicates(['index', 'code']).set_index('index')

Similar to yours, works because if there is no name, the .reset_index() method will name the column 'index'. There is also inplace parameter in case you want to modify the original variable df instead of returning a copy.

second method

index = pd.Series(data=[1,1,1,2,3,3])
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)

modified_df = df.reset_index().drop_duplicates(['index', 'code'])
modified_df.index = modified_df['index']
modified_df = modified_df.drop(columns=['index'])

Similarly .drop() method has an inplace parameter incase you want to modify the original. If inplace is true, None is returned, otherwise the copy, so you should not assign the return value to anything when you use inplace parameter.

Note: After modifying the DataFrame as you want, the df.index.name will have a name even if the original didn't have, and it will be 'index'. You can assign 'None' value to the index name freely if you don't want an index name.

the nice thing about your solution is that I don't have to name the index if I don't want to. That's what my question was actually about. I felt that having to name the index was a bit artificial. However, I still think the reset/drop/set pattern is a bit roundabout, but that's Python's responsibility, not yours :-) Thanks much! — Michael Tuchman, Jul 16 '23 at 21:36
I am glad I helped, I was doing Data Science course this past week and from what I learned setting and resetting index is a big part of it because it helps in getting what you need with less code, because of the label based .loc() method. — luzede, Jul 17 '23 at 04:52

gaurav · Answer 3 · 2023-07-18T05:12:26.137

Here is one way you can achieve the outcome:

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df['id'] = df.reset_index()['index'].values
print(df)
#     code  id
#1    A   1
#1    A   1
#1    B   1
#2    C   2
#3    D   3
#3    A   3

output = df.groupby(by=['code','id']).max().reset_index()
print(output)

#  code  id
#0    A   1
#1    A   3
#2    B   1
#3    C   2
#4    D   3

mrtig · Answer 4 · 2023-07-16T20:23:01.510

1

To add to your current approach:

The unnamed index will be converted to column name "index" after reset_index
As a second step, you can set the index as the first column

here's a sample:

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df.reset_index()
df = df[~df.duplicated(keep="first")]
df = df.set_index(df.iloc[:,0])
#....or
df = df.set_index(df.columns.to_list[0])

edited Jul 16 '23 at 20:23

answered Jul 16 '23 at 20:00

mrtig

2,217
16
26

score 1 · Answer 5 · answered Jul 17 '23 at 05:07

1

This is a shorter version of the first option provided by @luzede:

from pandas import DataFrame

df = DataFrame({"id": [1, 1, 1, 2, 3, 3], "code": ["A", "A", "B", "C", "D", "A"]})
deduped = df.drop_duplicates(subset=["id", "code"])
print(deduped)
#   code  id
# 0    A   1
# 2    B   1
# 3    C   2
# 4    D   3
# 5    A   3

Note that for simplicity the dataframe is constructed using "id" as a separate column (this yields the same result as the index-based method in the code snippet in the question).

answered Jul 17 '23 at 05:07

SultanOrazbayev

14,900
3
16
46

This works. All of these solutions point to just avoiding indices until after the de-duplication has performed – Michael Tuchman Jul 18 '23 at 11:08
I see, take a look at another answer I submitted which might be closer to what you are after. – SultanOrazbayev Jul 18 '23 at 11:26

Hasan Raza · Answer 6 · 2023-07-18T11:07:56.600

1

import pandas as pd

df = pd.DataFrame({'code': ['A', 'A', 'B', 'C', 'D', 'A']}, index=[1, 1, 1, 2, 3, 3])
df.index.name = 'id'
df = df.drop_duplicates(keep='first')

print(df)

edited Jul 18 '23 at 11:07

answered Jul 18 '23 at 06:24

Hasan Raza

424
3
9

Did you run this code? When I ran it, it dropped the ID=3, code=A line, which wouldn't solve my problem – Michael Tuchman Jul 18 '23 at 11:03
1

you can use the keep parameter while dropping only duplicate id – Hasan Raza Jul 18 '23 at 11:08

Deduplicate pandas dataset by index value without using `networkx`

6 Answers6

Linked