4

Acquire raw data --> transform it and join it with other files --> email to end-users for review

What is the best approach?

ldacey
  • 518
  • 8
  • 16

3 Answers3

3

If 'employee_id'+'customer_id'+'timestamp' is long, and you are interested in something that is unlikely to have collisions, you can replace it with a hash. The range and quality of the hash will determine the probability of collisions. Perhaps the simplest is to use the builtin hash. Assuming your DataFrame is df, and the columns are strings, this is

(df.employee_id + df.customer_id + df.timestamp).apply(hash)

If you want greater control of the size and collision probability, see this piece on non-crypotgraphic hash functions in Python.


Edit

Building on an answer to this question, you could build 10-character hashes like this:

import hashlib
df['survey_id'] = (df.employee_id + df.customer_id + df.timestamp).apply(
    lambda s: hashlib.md5(s).digest().encode('base64')[: 10])
Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • That was pretty simple. Is there a way to .apply some of the hashlib options if I wanted to try that library? – ldacey Mar 09 '16 at 19:09
  • 1
    @LanceDacey Sure, you could do it in many ways, including a lambda function. If you have some example question in mind, drop a comment here. – Ami Tavory Mar 09 '16 at 19:27
  • Thanks, I am new to encryption but I know I have seen some hashing which uses A-Z, a-z, 0-9 for example. My goal would be to combine those three columns into a new temporary column, and then hash it into a survey_id column. If we can make the id column a shorter string then that would be better (the combined column is pretty long, like mine would be ldacey201603081114239999999999). So df['survey_id'] = some hashlib function on df['combined'] – ldacey Mar 10 '16 at 12:56
  • 1
    @LanceDacey 1. I think that *non-cryptographic* hash functions are what you're looking for here. Cryptographic hash functions are extremely slow functions that attempt to make reverse engineering difficult - which is irrelevant to you. 2. Updated the answer - see if it answers your question. – Ami Tavory Mar 10 '16 at 13:34
  • Thanks for the extra insight and the example! I successfully implemented a solution now which seems to be working. – ldacey Mar 11 '16 at 13:04
  • One quick question too - these hash values are random right? Is there a similar method to create a hash which is just a different format of a long string? For example, what if I have two files and wanted to add an ID to represent those three columns and I wanted it to be the same on both files (which are read into memory at different times)? Would hash allow that? In pandas I do all of my joins based on 3 column merges, it would be pretty nice to just create a row ID which would be the same on all of the files with that same combination (timestamp, customer_id, employee_id) – ldacey Mar 11 '16 at 13:20
  • @LanceDacey I'd be happy to address your question, but it's a bit hard to do so in the comments section - I just don't understand what you're asking, exactly, and you can't format or show examples in the comments. Perhaps you could open a new question? If you drop a note here, I'll be happy to have a look at it (as I'm sure will others too). – Ami Tavory Mar 11 '16 at 20:46
  • Cool thanks, I posted a question here: http://stackoverflow.com/questions/35990352/creating-unique-integer-ids-which-are-equal-on-different-files – ldacey Mar 14 '16 at 14:33
0

If anyone is looking for a modularized function, save this into a file for use where needed. (for Pandas DataFrames)

df is your dataframe, columns is a list of columns to hash over, and name is the name of your new column with hash values.

Returns a copy of the original dataframe with a new column containing the hash of each row.

def hash_cols(df, columns, name="hash"):
    new_df = df.copy()
    def func(row, cols):
        col_data = []
        for col in cols:
            col_data.append(str(row.at[col]))

        col_combined = ''.join(col_data).encode()
        hashed_col = sha256(col_combined).hexdigest()
        return hashed_col

    new_df[name] = new_df.apply(lambda row: func(row,columns), axis=1)

    return new_df
0

I had a similar problem, that I solved thusly:

import hashlib
import pandas as pd
df = pd.DataFrame.from_dict({'mine': ['yours', 'amazing', 'pajamas'], 'have': ['something', 'nothing', 'between'], 'num': [1, 2, 3]})
hashes = []
for index, row in df.iterrows():
    hashes.append(hashlib.md5(str(row).encode('utf-8')).hexdigest())
# if you want the hashes in the df, 
# in my case, I needed them to form a JSON entry per row
df['hash'] = hashes

The results will form an md5 hash, but you can really use any hash function you need to.

sempervent
  • 833
  • 2
  • 11
  • 23