2

I have a pandas dataframe df with ids as strings: I am trying to create the new_claim and new_description columns

example df

Closest SO I found was Efficiently replace part of value from one column with value from another column in pandas using regex? but this uses split part, and since the description changes I was unable to generalize.

I can run a one off

date_reg = re.compile(r'\b'+df['old_id'][1]+r'\b')

df['new_claim'] = df['claim'].replace(to_replace=date_reg, value=df['external_id'], inplace=False)

But if I have

date_reg = re.compile(r'\b'+df['claim']+r'\b')

Then I get "TypeError: 'Series' objects are mutable, thus they cannot be hashed"

Another approach I took

df['new_claim'] = df['claim']

for i in range(5):
    old_id = df['old_id'][i]
    new_id = df['external_id'][i]

    df['new_claim'][i] = df['claim'][i].replace(to_replace=old_id,value=new_id)

which givesa TypeError: replace() takes no keyword arguments

Matt S
  • 55
  • 1
  • 6

1 Answers1

2

Using just the method pandas.replace():

df.old_id = df.old_id.fillna(0).astype('int')

list_old = list(map(str, df.old_id.tolist()))
list_new = list(map(str, df.external_id.tolist()))

df['new_claim'] = df.claim.replace(to_replace=['Claim ID: ' + e for e in list_old], value=['Claim ID: ' + e for e in list_new], regex=True)
df['new_description'] = df.description.replace(to_replace=['\* ' + e + '\\n' for e in list_old], value=['* ' + e + '\\n' for e in list_new], regex=True)

Produces the following output:

enter image description here

Daniel Labbe
  • 1,979
  • 3
  • 15
  • 20
  • Thank you, I got an unexpected output when old_id = 2838976 and external_id = 102372949 the new claim becomes Claim ID: 102371023948109, in most cases its working as expected – Matt S Feb 22 '19 at 16:15
  • @MattS, the problem is that a smaller pattern is match in a bigger ID. For instance, a value 11 is found in the 1111 id. – Daniel Labbe Feb 22 '19 at 16:35
  • If you could post a real case test, we can try to find another pattern to filter, like using the semicolon as part of the value. – Daniel Labbe Feb 22 '19 at 16:41
  • https://github.com/sedaghatfar/Python_projects/blob/master/test_data.csv – Matt S Feb 22 '19 at 18:49
  • Hi @MattS, I have improved the pattern to match. The problem was the empty old_id. Please accept and upvote my answer if it is everything correct! :) – Daniel Labbe Feb 22 '19 at 19:55
  • I'm still getting unexpected output, https://github.com/sedaghatfar/Python_projects/blob/master/claim_update.ipynb – Matt S Feb 22 '19 at 21:19
  • Few things to change, @MattS : – Daniel Labbe Feb 22 '19 at 22:23
  • you should convert the old_id column to int, with the following code df.old_id = df.old_id.fillna(0).astype('int') – Daniel Labbe Feb 22 '19 at 22:24
  • The pattern you are using is different to_replace=['Claim ID: : ' + e . There is an extra semicolon – Daniel Labbe Feb 22 '19 at 22:25
  • I changed also the pattern for the description column that has a \n in your dataset. Updated the code, @MattS – Daniel Labbe Feb 22 '19 at 22:26