1

I have a dataframe analysis_df with the following structure:

...     FileName      UserSID   ImageSize   ImageChecksum
0       2197173372750839    0   17068032    11781483
1       5966634109289989    0   24576       42058
... ... ... ... ...
7500    6817023204572264    0   22000       123456
7501    6817023204572264    0   22000       123456

and need to create a new row that tells how many times each ImageChecksum repeats in the table. So I count them:

count_db = {}
for checksum in analysis_df['ImageChecksum']:
    checksum = str(checksum)
    if checksum in count_db:
        count_db[checksum] += 1
    else:
        count_db[checksum] = 1

print(f"count_db: {count_db}")

output:

count_db: {'11781483': 100, '42058': 100, '56817': 100, '491537': 100, '195631': 100, '146603': 100, '104915': 100, ... [snip] ..., '123456': 2}

So according to an answer to a question related, but not quite identical, I can do something similar like:

import pandas as pd
import numpy as np

df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])

df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal

But when I try to apply this solution to my own case, I get an error:

analysis_df['ImageChecksum_Count'] = count_db[str(analysis_df['ImageChecksum'])]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [22], in <cell line: 21>()
     17         count_db[checksum] = 1
     19 print(f"count_db: {count_db}")
---> 21 analysis_df['ImageChecksum_Count'] = count_db[str(analysis_df['ImageChecksum'])]
     23 analysis_df.head()

KeyError: '0       11781483\n1          42058\n2          56817\n3         491537\n4         195631\n          ...   \n7497      125321\n7498       57364\n7499           0\n7500      123456\n7501      123456\nName: ImageChecksum, Length: 7502, dtype: int64'

Looking at this error, I get basically what I've done; I'm trying to apply normal programming to this sort of pythonic, vectorized functionality and it doesn't work.

I always find vectorized syntax and programming confusing in Python, what with overloaded operators and whatever magic is happening behind that kind of syntax. It's very foreign to me coming from a JavaScript background.

Can someone explain the correct way to do this?

Edit:

I found that this works:

for i, row in analysis_df.iterrows():
    analysis_df.iat[i, checksum_count_col_index] = count_db[str(analysis_df.iat[i, checksum_col_index])]

But doesn't this approach sort of go against the vectorized approach you're supposed to use with DataFrames, especially with large datasets? I'd still be glad to learn the right way to do it.

J.Todd
  • 707
  • 1
  • 12
  • 34

1 Answers1

1

To create a new row that tells how many times each ImageChecksum repeats create a group with groupby('ImageChecksum') and use Pandas transform with the count function to produce a DataFrame containing the total group count for each row in that same group.

import pandas as pd
import numpy as np

#setup
np.random.seed(42)
d = {
    'FileName': np.random.randint(1e10,1e11,(7500,)),
    'UserSID':  0,
    'ImageSize': np.random.randint(10e3,10e6,(7500,)),
    'ImageChecksum': np.random.randint(1e3,2e3,(7500,))
}
df = pd.DataFrame(d)

#code
df['Checksum_Count'] = df.groupby('ImageChecksum')['ImageChecksum'].transform('count')
print(df)

Output from df

         FileName  UserSID  ImageSize  ImageChecksum  Checksum_Count
0     39190929843        0    2507537           1308               5
1     56298420295        0    6210513           1435               8
2     37684640889        0    8926726           1181              10
3     87738800342        0    7546371           1587              13
4     54922131914        0    9606013           1615               3
...           ...      ...        ...            ...             ...
7495  26663101742        0    2712348           1930               8
7496  69910223413        0    4020112           1284              10
7497  12097837305        0    9868594           1549               8
7498  46519177978        0    8563465           1196               9
7499  38437838102        0    5486946           1279               5

[7500 rows x 5 columns]
# Check how many times (and where) checksum value "1615" repeats
df[df['ImageChecksum'] == 1615]

         FileName  UserSID  ImageSize  ImageChecksum  Checksum_Count
4     54922131914        0    9606013           1615               3
689   65788124385        0    5926645           1615               3
7401  70505133093        0    5343748           1615               3
n1colas.m
  • 3,863
  • 4
  • 15
  • 28