I have a dataframe analysis_df
with the following structure:
... FileName UserSID ImageSize ImageChecksum
0 2197173372750839 0 17068032 11781483
1 5966634109289989 0 24576 42058
... ... ... ... ...
7500 6817023204572264 0 22000 123456
7501 6817023204572264 0 22000 123456
and need to create a new row that tells how many times each ImageChecksum
repeats in the table. So I count them:
count_db = {}
for checksum in analysis_df['ImageChecksum']:
checksum = str(checksum)
if checksum in count_db:
count_db[checksum] += 1
else:
count_db[checksum] = 1
print(f"count_db: {count_db}")
output:
count_db: {'11781483': 100, '42058': 100, '56817': 100, '491537': 100, '195631': 100, '146603': 100, '104915': 100, ... [snip] ..., '123456': 2}
So according to an answer to a question related, but not quite identical, I can do something similar like:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
['cat', 'ragdoll', 1]],
columns=['animal', 'type', 'age'])
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
+ df.type + ' ' + df.animal
But when I try to apply this solution to my own case, I get an error:
analysis_df['ImageChecksum_Count'] = count_db[str(analysis_df['ImageChecksum'])]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [22], in <cell line: 21>()
17 count_db[checksum] = 1
19 print(f"count_db: {count_db}")
---> 21 analysis_df['ImageChecksum_Count'] = count_db[str(analysis_df['ImageChecksum'])]
23 analysis_df.head()
KeyError: '0 11781483\n1 42058\n2 56817\n3 491537\n4 195631\n ... \n7497 125321\n7498 57364\n7499 0\n7500 123456\n7501 123456\nName: ImageChecksum, Length: 7502, dtype: int64'
Looking at this error, I get basically what I've done; I'm trying to apply normal programming to this sort of pythonic, vectorized functionality and it doesn't work.
I always find vectorized syntax and programming confusing in Python, what with overloaded operators and whatever magic is happening behind that kind of syntax. It's very foreign to me coming from a JavaScript background.
Can someone explain the correct way to do this?
Edit:
I found that this works:
for i, row in analysis_df.iterrows():
analysis_df.iat[i, checksum_count_col_index] = count_db[str(analysis_df.iat[i, checksum_col_index])]
But doesn't this approach sort of go against the vectorized approach you're supposed to use with DataFrames, especially with large datasets? I'd still be glad to learn the right way to do it.