Pandas duplicated vs groupby to flag all duplicate values

Question

I have a pretty simple need that has come up in a couple other posts, but I'm not sure if a better way to approach it is with a groupby or duplicated method.

I have what I need below with duplicated except the first duplicate is being flagged as FALSE instead of TRUE. I need all duplicates as TRUE.

My goal is to be able to concatenate data from two columns together when it's a duplicate, otherwise, leave the data as-is.

Sample Input:

ID  File Name
1   Text.csv
2   TEXT.csv
3   unique.csv
4   unique2.csv
5   text.csv

Desired Output:

ID  File Name   LowerFileName   Duplicate   UniqueFileName
1   Text.csv    text.csv    TRUE    1Text.csv
2   TEXT.csv    text.csv    TRUE    2TEXT.csv
3   unique.csv  unique.csv  FALSE   unique.csv
4   unique2.csv unique2.csv FALSE   unique2.csv
5   text.csv    text.csv    TRUE    5text.csv


df_attachment = pd.read_csv("Attachment.csv")
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()
df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName')
#This syntax is incorrect 
df_attachment['UniqueFileName'] = np.where(df_attachment['Duplicate']=='TRUE', pd.concat(df_attachment['ID'],df_attachment['File Name']), df_attachment['File Name'))

Shouldn't the LowerFileName in the desired output for ID 4 be 'unique2.csv'? — Alexander, Jun 24 '15 at 03:58

bastewart · Answer 1 · 2017-03-28T10:59:08.580

The easiest way to "get around" this odd Pandas functionality is to generate a mask using df.duplicated(col_name) | df.duplicated(col_name, take_last=True). The bitwise or means that the series you generate is True for all duplicates.

Follow this with using the indexes to set the values that you from the original name or a new name with the number in fron.

In your case below:

# Generating your DataFrame
df_attachment = pd.DataFrame(index=range(5))
df_attachment['ID'] = [1, 2, 3, 4, 5]
df_attachment['File Name'] = ['Text.csv', 'TEXT.csv', 'unique.csv',
                             'unique2.csv', 'text.csv']
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()


# Answer from here, mask generation over two lines for readability
mask = df_attachment.duplicated('LowerFileName')
mask = mask | df_attachment.duplicated('LowerFileName', take_last=True)
df_attachment['Duplicate'] = mask

# New column names if possible
df_attachment['number_name'] = df_attachment['ID'].astype(str) + df_attachment['File Name']

# Set the final unique name column using the mask already generated
df_attachment.loc[mask, 'UniqueFileName'] = df_attachment.loc[mask, 'number_name']
df_attachment.loc[~mask, 'UniqueFileName'] = df_attachment.loc[~mask, 'File Name']

# Drop the intermediate column used
del df_attachment['number_name']

And the final df_attachment:

    ID  File Name   LowerFileName   Duplicate   UniqueFileName
0   1   Text.csv    text.csv    True    1Text.csv
1   2   TEXT.csv    text.csv    True    2TEXT.csv
2   3   unique.csv  unique.csv  False   unique.csv
3   4   unique2.csv unique2.csv False   unique2.csv
4   5   text.csv    text.csv    True    5text.csv

This method uses vectorised pandas operations and indexing so should be quick for any size DataFrame.

EDIT: 2017-03-28

Someone gave this a vote yesterday so I thought I would edit this to say that this has been supported natively by pandas since 0.17.0, see the changes here: http://pandas.pydata.org/pandas-docs/version/0.19.2/whatsnew.html#v0-17-0-october-9-2015

Now you can use the keep argument of drop_duplicates and duplicated and set it to False to mark all duplicates: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html

So above the lines generating duplicated column become:

df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName', keep=False)

score 1 · Answer 2 · answered Jun 24 '15 at 03:53

For your use case you'll need to use groupby:

dupes = df_attachment.groupby('Name').ID.count() > 1
dupes.name = 'Duplicate'
#merge duplicate flage into the original dataframe on the common column 'Name'
df_attachment = pd.merge(df_attachment, dupes.reset_index())

Alexander · Accepted Answer · 2015-06-24T20:23:52.083

Perhaps the use of groupby together with a lambda expression can achieve your objective:

gb = df.groupby('Lower File Name')['Lower File Name'].count()
duplicates = gb[gb > 1].index.tolist()
df['UniqueFileName'] = \
    df.apply(lambda x: '{0}{1}'.format(x.ID if x['Lower File Name'] in duplicates
                                       else "", x['File Name']), axis=1)

>>> df
   ID    File Name Lower File Name Duplicate   UniqueFileName
0   1     Text.csv        text.csv     False        1Text.csv
1   2     TEXT.csv        text.csv      True        2TEXT.csv
2   3   unique.csv      unique.csv     False      3unique.csv
3   4  unique2.csv     unique2.csv     False  Noneunique2.csv
4   5     text.csv        text.csv      True        5text.csv
5   6   uniquE.csv      unique.csv      True      6uniquE.csv

The lambda expression generates a unique filename per the OP's requirements by prepending File Name with the relevant ID only in the event that the Lower File Name is duplicated (i.e. there is more than one file with the same lower case file name). Otherwise, it just uses the lowercase filename without an ID.

Note that this solution does not use the Duplicate column in the above DataFrame.

Also, wouldn't it be simpler to simply append the ID to the Lower File Name in order to generate a unique name? You wouldn't need the solution above and don't even need to check for duplicates, assuming the ID is unique.

Thanks Alexander! I had to make one modification as it was appending "None" to the non-duplicate file names: `df.apply(lambda x: '{0}{1}'.format(x.ID if x['Lower File Name'] in duplicates else '', x['File Name']), axis=1)` And yes, it would have been much easier to just append the ID in all cases but unfortunately had to match another file in that format. — EMC, Jun 24 '15 at 18:59
This is a bit tricky from a readability perspective. It's a bit hard to tell what's going on. — LondonRob, Jun 24 '15 at 19:44

score 0 · Answer 4 · edited May 23 '17 at 12:02

Inspired by this answer you could do (assuming your File Name column is renamed to file_name):

df['unique_name'] = df.file_name
dupes = df.file_name[df.file_name.str.lower().duplicated()]
unique_names = df.ID.astype(str) + df.file_name
df.loc[df.file_name.isin(dupes), 'unique_name'] = unique_names

Which gives you:

   ID    File Name  unique_name
0   1     Text.csv     Text.csv
1   2     TEXT.csv    2TEXT.csv
2   3   unique.csv   unique.csv
3   4  unique2.csv  unique2.csv
4   5     text.csv    5text.csv

Pandas duplicated vs groupby to flag all duplicate values

4 Answers4

EDIT: 2017-03-28