Count files with duplicated checksum in Pandas

Question

I would like to count duplicated files in different folders. These files have different names so I use the checksum of each file to track duplicates. Here is the sample data

import pandas as pd

df = pd.DataFrame({
    'folder': ['d1', 'd1', 'd1', 'd2', 'd2', 'd2', 'd2', 'd2'],
    'checksum': ['x0', 'x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x3']})

df.head(10)

folder  checksum
d1      x0
d1      x1
d1      x1
d2      x1
d2      x2
d2      x2
d2      x2
d2      x3

Here is my solution, but I don't like it because it's too long. Does anyone know better way to achieve the same result using something like Pandas.pivot_table or Pandas.crosstab? Thanks

def count_duplicated_checksum_among_folders(df):
    folders = df.folder.unique()
    result = []
    for fol1 in folders:
        row = []
        selected1 = df[(df.folder == fol1 ) & df.checksum.duplicated(keep=False)]
        dup_checksum = selected1.checksum.unique()
        for fol2 in folders:
            if fol1 == fol2 :
                total_dup = selected1.checksum.duplicated(keep=False).sum()
                row.append(total_dup)            
            else:
                selected2 = df[(df.folder == fol2) & df.checksum.isin(dup_checksum)]
                total_dup = selected2.checksum.unique().size
                row.append(total_dup)        
        result.append(row)
    return pd.DataFrame(result, columns=folders, index=folders)

Expected output

output = count_duplicated_checksum_among_folders(df)

    d1  d2
d1  2   1
d2  1   3

Since there are 2 folders in this example, the expected output is a 2x2 matrix

[[A, B]
 [C, D]]

A = total duplicates within folder d1
D = total duplicates within folder d2 
B = C = total duplicates that exist in both folders d1 and d2

An ideal solution must be able to handle any number of folders. Here is another dataset with 5 folders. The expected answer is given in below as well.

folder = 'd1 d1 d1 d2 d2 d2 d2 d2 d3 d3 d1 d4 d1 d5'.split()
checksum = 'x0 x1 x1 x1 x2 x2 x2 x3 x4 x4 x4 x5 x5 x6'.split()
df2 = pd.DataFrame(dict(folder=folder, checksum=checksum))

    d1  d2  d3  d4  d5
d1  2   1   1   1   0
d2  1   3   0   0   0
d3  1   0   2   0   0
d4  1   0   0   0   0
d5  0   0   0   0   0

Whats the relationship between x and d? Piece of advice, no one is likely to read your code.A concise description of the problem or a tabulation of the expected output attracts quick answers. — wwnde, May 14 '20 at 23:06
@wwnde I think a working version of a code describes the desired results less ambiguously than outputs. Even if I don't "read" it I might want to use it to check my solution against. — chthonicdaemon, May 16 '20 at 11:46
@wwnde, x is the checksum of the file. If the content of the file is different or altered, it’s checksum will changed as well. Files are copied to different folders so duplication exists as a result. So the goal of this function is to track the count of duplicated and their folder locationS — Scoodood, May 16 '20 at 19:56

score 1 · Answer 1 · 2020-05-16T14:36:52.580

1

If you want to check the duplicates and which file is duplicate using the mentioned dataframe, you can use groupby.

In [2]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[2]:
folder  checksum
d1      x0          1
        x1          2
d2      x1          1
        x2          3
        x3          1
dtype: int64

The above code is creating a group based on all columns, in your case folder and checksum and then computing the count of each distinct value in that group. This will show that x1 checksum is twice in d1 and x2 thrice in d2.

Update 2:

If you need the count then :

counter=df.groupby(df.columns.tolist(),as_index=False).size().unstack(fill_value=0)
counter_left=counter.ge(2,axis=0).dot(counter.T)
counter_right=counter.dot(counter.ge(2,axis=1).T)

#Now we merger left and right.
final=counter_left.add(counter.right) 
#and divide left diagonal element by 2
for diag in final:
    final.loc[diag,diag]//=2
print(final)

Output:

folder  d1  d2
folder        
d1       2   1
d2       1   3

In counter_left, I first found all duplicate files using counter.ge(2,axis=0) which outputs :

checksum     x0     x1     x2     x3
folder                              
d1        False   True  False  False
d2        False  False   True  False

and then taking the dot product with counter.T gives output :

folder  d1  d2
folder        
d1       2   1
d2       0   3

Doing the same for column_right and merging them gives final output.

edited May 16 '20 at 14:36

answered May 14 '20 at 23:18

Interesting suggestion. I have more than 10k of files, I believe your output listing is going to be extremely long due to the `checksum` columns, and very hard to keep track the count of duplicates in each folder. But thanks for your suggestion. – Scoodood May 15 '20 at 06:17
Hi @ScoodoodC, I tried an approach on it and it tells you the count, I have edited the answer to include that update where you get the duplicate count. Is the new update helpful? – May 15 '20 at 22:53
if this answer solved your problem could you please mark it as accepted by clicking the check mark on its side? Thankyou – May 15 '20 at 23:07
hi @overflow95, thanks for your update. However, it still doesn't achieve my goal. My expected answer is a 2x2 matrix [[A, B][C, D]]. A = total duplicates only found in d1; D = total duplicates only found in d2; B = C = duplicates that exist in both d1 and d2. Perhaps my question was not clear. I had re-clarified my question. Thanks again. – Scoodood May 16 '20 at 10:31
Made some final changes according to your new edit. Hope it helps. – May 16 '20 at 14:37
hi @overflow95, I have improved my final test case with 5 folders. But your answer is different. – Scoodood May 16 '20 at 22:05

jezrael · Accepted Answer · 2020-05-17T05:58:02.913

1

Use DataFrame.merge with DataFrame.pivot_table and Series.nunique:

df1 = (df.merge(df, on='checksum')
         .pivot_table(index='folder_x',
                     columns='folder_y', 
                     values='checksum', 
                     aggfunc='nunique',
                     fill_value=0)
         .rename_axis(index=None, columns=None))
print (df1)
    d1  d2  d3  d4  d5
d1   4   1   1   1   0
d2   1   3   0   0   0
d3   1   0   1   0   0
d4   1   0   0   1   0
d5   0   0   0   0   1

And then set diagonal by number by number of total duplicated values per folder:

s = (df[df.duplicated(['checksum','folder'], keep=False)]
      .groupby('folder')
      .size()
      .reindex(df['folder'].unique(), fill_value=0))
print (s)
folder
d1    2
d2    3
d3    2
d4    0
d5    0
dtype: int64

#https://stackoverflow.com/a/24475214/2901002
np.fill_diagonal(df1.values, s)
print (df1)
    d1  d2  d3  d4  d5
d1   2   1   1   1   0
d2   1   3   0   0   0
d3   1   0   2   0   0
d4   1   0   0   0   0
d5   0   0   0   0   0

Verify:

output = count_duplicated_checksum_among_folders(df)
print (output)
    d1  d2  d3  d4  d5
d1   2   1   1   1   0
d2   1   3   0   0   0
d3   1   0   2   0   0
d4   1   0   0   0   0
d5   0   0   0   0   0

edited May 17 '20 at 05:58

answered May 16 '20 at 11:43

jezrael

822,522
95
1,334
1,252

Your approach was very close to what I was looking for. It’s very readable. But what if there are n-folders, can your code handle arbitrary number of folders? Thanks! – Scoodood May 16 '20 at 20:06
@Scoodood C I test it and compare with your solution, output is different. I think difference should be by processing same duplicated values, not idea how is expected output if more folders. So I suggest create data sample with more folders and if my solution not working, I can to try change it. Thanks. – jezrael May 16 '20 at 20:50
dear @jezrael, I have updated my original question to include dataset with 3 folders. I really like your answer, however, your code produced different answer than mine. Something is missing in your code. – Scoodood May 16 '20 at 20:52
1

Very interesting solution, very pythonic! Just love it!! Thanks @jezrael – Scoodood May 17 '20 at 06:16

score 0 · Answer 3 · edited Jun 25 '20 at 18:21

0

Try:

df=df.assign(i=1).pivot_table(columns="folder", index="checksum", aggfunc="sum", values="i").fillna(0)

l=df.gt(1).T@df

r=df.T@df.gt(1)

res=pd.DataFrame([], columns=df.columns, index=df.columns)

res[:]=l+r

for col in res:
    res.loc[col, col]//=2

Outputs:

folder  d1  d2
folder
d1       2   1
d2       1   3

edited Jun 25 '20 at 18:21

halfer

19,824
17
99
186

answered May 16 '20 at 11:39

Grzegorz Skibinski

12,624
2
11
34

Thanks for your suggestion. Your solution looks very clean as well. I have improved my final test case with 5 folders. But your answer is different than the expected. – Scoodood May 16 '20 at 22:07

Count files with duplicated checksum in Pandas

3 Answers3