Acquire raw data --> transform it and join it with other files --> email to end-users for review
What is the best approach?
If 'employee_id'+'customer_id'+'timestamp'
is long, and you are interested in something that is unlikely to have collisions, you can replace it with a hash. The range and quality of the hash will determine the probability of collisions. Perhaps the simplest is to use the builtin hash
. Assuming your DataFrame is df
, and the columns are strings, this is
(df.employee_id + df.customer_id + df.timestamp).apply(hash)
If you want greater control of the size and collision probability, see this piece on non-crypotgraphic hash functions in Python.
Edit
Building on an answer to this question, you could build 10-character hashes like this:
import hashlib
df['survey_id'] = (df.employee_id + df.customer_id + df.timestamp).apply(
lambda s: hashlib.md5(s).digest().encode('base64')[: 10])
If anyone is looking for a modularized function, save this into a file for use where needed. (for Pandas DataFrames)
df
is your dataframe, columns
is a list of columns to hash over, and name
is the name of your new column with hash values.
Returns a copy of the original dataframe with a new column containing the hash of each row.
def hash_cols(df, columns, name="hash"):
new_df = df.copy()
def func(row, cols):
col_data = []
for col in cols:
col_data.append(str(row.at[col]))
col_combined = ''.join(col_data).encode()
hashed_col = sha256(col_combined).hexdigest()
return hashed_col
new_df[name] = new_df.apply(lambda row: func(row,columns), axis=1)
return new_df
I had a similar problem, that I solved thusly:
import hashlib
import pandas as pd
df = pd.DataFrame.from_dict({'mine': ['yours', 'amazing', 'pajamas'], 'have': ['something', 'nothing', 'between'], 'num': [1, 2, 3]})
hashes = []
for index, row in df.iterrows():
hashes.append(hashlib.md5(str(row).encode('utf-8')).hexdigest())
# if you want the hashes in the df,
# in my case, I needed them to form a JSON entry per row
df['hash'] = hashes
The results will form an md5 hash, but you can really use any hash function you need to.