I would like to count duplicated files in different folders. These files have different names so I use the checksum of each file to track duplicates. Here is the sample data
import pandas as pd
df = pd.DataFrame({
'folder': ['d1', 'd1', 'd1', 'd2', 'd2', 'd2', 'd2', 'd2'],
'checksum': ['x0', 'x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x3']})
df.head(10)
folder checksum
d1 x0
d1 x1
d1 x1
d2 x1
d2 x2
d2 x2
d2 x2
d2 x3
Here is my solution, but I don't like it because it's too long. Does anyone know better way to achieve the same result using something like Pandas.pivot_table
or Pandas.crosstab
? Thanks
def count_duplicated_checksum_among_folders(df):
folders = df.folder.unique()
result = []
for fol1 in folders:
row = []
selected1 = df[(df.folder == fol1 ) & df.checksum.duplicated(keep=False)]
dup_checksum = selected1.checksum.unique()
for fol2 in folders:
if fol1 == fol2 :
total_dup = selected1.checksum.duplicated(keep=False).sum()
row.append(total_dup)
else:
selected2 = df[(df.folder == fol2) & df.checksum.isin(dup_checksum)]
total_dup = selected2.checksum.unique().size
row.append(total_dup)
result.append(row)
return pd.DataFrame(result, columns=folders, index=folders)
Expected output
output = count_duplicated_checksum_among_folders(df)
d1 d2
d1 2 1
d2 1 3
Since there are 2 folders in this example, the expected output is a 2x2 matrix
[[A, B]
[C, D]]
A = total duplicates within folder d1
D = total duplicates within folder d2
B = C = total duplicates that exist in both folders d1 and d2
An ideal solution must be able to handle any number of folders. Here is another dataset with 5 folders. The expected answer is given in below as well.
folder = 'd1 d1 d1 d2 d2 d2 d2 d2 d3 d3 d1 d4 d1 d5'.split()
checksum = 'x0 x1 x1 x1 x2 x2 x2 x3 x4 x4 x4 x5 x5 x6'.split()
df2 = pd.DataFrame(dict(folder=folder, checksum=checksum))
d1 d2 d3 d4 d5
d1 2 1 1 1 0
d2 1 3 0 0 0
d3 1 0 2 0 0
d4 1 0 0 0 0
d5 0 0 0 0 0