1

I am trying to replace certain strings within a column in a dataframe using a txt file.

I have a dataframe that looks like the following (this is a very small version of a massive dataframe that i have).

coffee_directions_df

Utterance                         Frequency   

Directions to Starbucks           1045
Directions to Tullys              1034
Give me directions to Tullys      986
Directions to Seattles Best       875
Show me directions to Dunkin      812
Directions to Daily Dozen         789
Show me directions to Starbucks   754
Give me directions to Dunkin      612
Navigate me to Seattles Best      498
Display navigation to Starbucks   376
Direct me to Starbucks            201

The DF shows utterances made by people and the frequency of utterances.

I.e., "Directions to Starbucks" was uttered 1045 times.

I have another DataFrame in xlsx format coffee_donut.xlsx that I want to use to import and replace certain strings (similar to what Replace words by checking from pandas dataframe asked).

coffee_donut

Token              Synonyms

Starbucks          Coffee
Tullys             Coffee
Seattles Best      Coffee
Dunkin             Donut
Daily Dozen        Donut

And ultimately, I want the dataframe to look like this:

coffee_donut_df

Utterance                        Frequency   

Directions to Coffee             1045
Directions to Coffee             1034
Give me directions to Coffee     986
Directions to Coffee             875
Show me directions to Donut      812
Directions to Donut              789
.
.
.

I followed the previous question's steps, but i got stuck at the last part:

import re
import pandas as pd
sdf = pd.read_excel('C:\coffee_donut.xlsx')
rep = dict(zip(sdf.Token, sdf.Synonyms)) #convert into dictionary

rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
rep = pattern.sub(lambda m: rep[re.escape(m.group(0))], **coffee_directions_df**)

print rep

How do I apply the rep to the dataframe?? I'm so sorry if this is such a noob question. I really appreciate your help.

Thanks!!

user_seaweed
  • 141
  • 1
  • 8

1 Answers1

1

You almost had it! Here's a solution that reuses the regex object and lambda function in your current code.

Instead of your last line (rep = pattern.sub(...), run this:

coffee_directions_df['Utterance'] = \
coffee_directions_df['Utterance'].str.replace(pattern, lambda m: rep[m.group(0)])

# Confirm replacement
coffee_directions_df
                          Utterance  Frequency
0          Directions to Coffee       1045
1          Directions to Coffee       1034
2  Give me directions to Coffee        986
3   Directions to Seattles Best        875
...

This works because pd.Series.str.replace can accept a compiled regex object and a function; see the docs for more.

Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
  • thanks! I'm getting an error message ['dict' object has no attribute 'iteritems'] for [rep = dict((re.escapte(k)...]. is this supposed to be something different when i'm trying to run it through a dataframe? thanks again for any suggestions! – user_seaweed Apr 05 '18 at 12:56
  • ah, you must be using Python 3, whereas that code from the other question is Python 2. Change `iteritems` to `items` and you should be good to go! – Peter Leimbigler Apr 05 '18 at 14:58
  • Thanks @Peter, i changed it to `items` but then i got an error message `KeyError: '一'` do you have any idea how i can fix this too? – user_seaweed Apr 05 '18 at 16:42
  • updated the question here https://stackoverflow.com/questions/49677554/dict-items-getting-error-message-keyerror-%e4%b8%80-in-pandas-python – user_seaweed Apr 05 '18 at 16:46
  • That means the replacement dict `rep` does not have the character `一` among its keys. Try re-running the entire code again. If that doesn't work, try searching the file `coffee_donut.xlsx` for that character, and investigating from there. – Peter Leimbigler Apr 05 '18 at 16:47
  • if i have an excel file that includes Japanese text (now i understand where 一 (1) is coming from, do i need to encode the excel file as utf-8? or is it better to use a csv file instead for utf-8? – user_seaweed Apr 05 '18 at 16:54
  • I don't have much experience with character encodings, but UTF-8 is definitely a good idea to try :) – Peter Leimbigler Apr 05 '18 at 17:33