1

I have a DataFrame with 29 columns, and need to replace part of a string in some columns with a hashed part of the string.

Example of the column is as follows:

ABSX, PLAN=PLAN_A ;SFFBJD
ADSFJ, PLAN=PLAN_B ;AHJDG
... 
... 

Code that captures the part of the string:

Test[14] = Test[14].replace({'(?<=PLAN=)(^"]+ ;)' :'hello'}, regex=True)

I want to change the 'hello' to hash of '(?<=PLAN=)(^"]+ ;)' but it doesn't work this way. Wanted to check if anyone did this before without looping line by line of the DataFrame?

S3DEV
  • 8,768
  • 3
  • 31
  • 42
user3782604
  • 330
  • 1
  • 19

1 Answers1

2

here is what I suggest:

import hashlib
import re
import pandas as pd
# First I reproduce a similar dataset
df = pd.DataFrame({"v1":["ABSX", "ADSFJ"],
                   "v2": ["PLAN=PLAN_A", "PLAN=PLAN_B"],
                   "v3": ["SFFBJD", "AHJDG"]})

# I search for the regex and create a column matched_el with the hash
r = re.compile(r'=[a-zA-Z_]+')
df["matched_el"] = ["".join(r.findall(w)) for w in df.v2]
df["matched_el"] = df["matched_el"].str.replace("=","")
df["matched_el"] = [hashlib.md5(w.encode()).hexdigest() for w in df.matched_el]
# Then I replace in v2 using this hash
df["v2"] = df["v2"].str.replace("(=[a-zA-Z_]+)", "=")+df["matched_el"]
df = df.drop(columns="matched_el")

Here is the result

      v1                                     v2      v3
0   ABSX  PLAN=8d846f78aa0b0debd89fc1faafc4c40f  SFFBJD
1  ADSFJ  PLAN=3b9a3c8184829ca5571cb08c0cf73c8d   AHJDG
Raphaele Adjerad
  • 1,117
  • 6
  • 12
  • Nice. May I ask, why not use `df.apply()` for the regex and `hashlib` functions, rather than looping over the column via list comp.? – S3DEV Apr 07 '20 at 07:36
  • 1
    Hello, thanks a lot. I did not use it because I understand that using apply on rows is not necessarily faster than for loops https://stackoverflow.com/questions/38938318/why-apply-sometimes-isnt-faster-than-for-loop-in-pandas-dataframe, but if speed is not in question here then I could have definitely included it here. – Raphaele Adjerad Apr 07 '20 at 16:02