1

I have a large CSV file that has many short words and I need to change them into a full word. I found few posts here such as 1, 2 but most of these are either change the entire row or needs to do manually one by one.

My CSV file looks like:

infoID               messages
 111     we need to fix the car mag but we can't
 113         we need a shf to perform eng change
 115                      gr is needed to change
 116                            bat needs change
 117                    car towed for ext change 
 118                              car ml is high
  .
  .

My another file that has all the full word of short-form words and I want to use that to apply in my document and it is in the form of:

shf:shaft
gr:gear
ml:mileage

It would be great if you can provide your help with code that I can run in my side as well. Thanks

Bilgin
  • 499
  • 1
  • 10
  • 25

2 Answers2

4

Read your text file in as a Series that looks like

s

0    mag:magnitude
1        shf:shaft
2          gr:gear
3      bat:battery
4      ext:exhaust
5       ml:mileage
Name: 0, dtype: object

Split on colon and convert the series into a dictionary mapping key to its replacement:

dict(s.str.split(':').tolist())

# {'bat': 'battery',
#  'ext': 'exhaust',
#  'gr': 'gear',
#  'mag': 'magnitude',
#  'ml': 'mileage',
#  'shf': 'shaft'}

Use this to perform a replace operation with regex=True:

df['messages'].replace(dict(s.str.split(':').tolist()), regex=True)

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

Note that if these are strictly whole word replacements, you can extend this solution by converting the key strings into regular expressions that use word boundaries. For good measure, escape the string as well:

import re

mapping = {fr'\b{re.escape(k)}\b': v for k, v in s.str.split(':').tolist()}
df['messages'].replace(mapping, regex=True)

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Just curious, why is the `regex=True` required? – razdi Jun 07 '19 at 00:28
  • Isn't this a bit error prone if there is any other word that contains the keys of dictionary? E.g. something like _great_ will be changed to _geareat_ as well. – Chris Jun 07 '19 at 00:31
  • 3
    @razdi Without it, pandas looks for exact match, so the whole line must match the searched text. – remeus Jun 07 '19 at 00:31
  • @Chris That's true, but without any context this is the simplest solution. If whole word replacements are required, then the solution can be extended with word boundaries. – cs95 Jun 07 '19 at 00:32
  • @cs95 I see. I too love the simplicity of your post. Thanks for the response :) – Chris Jun 07 '19 at 00:36
  • @razdi like Remeus said, `replace` matches the entire element by default. Enabling `regex=False` allow us to perform multiple substring replacements. – cs95 Jun 07 '19 at 00:36
  • @cs95 thanks for your comment. reading to series such as ```s``` you mentioned but i tried using ```s=pd.read_table('domainWords.txt',header=None)``` and it did not work. can you please let me know how your read the .txt file as series? thanks – Bilgin Jun 07 '19 at 01:25
  • @Bilgin No worries: `s = pd.read_csv(header=None, squeeze=True)` – cs95 Jun 07 '19 at 01:38
  • @cs95 sure, I will be happy. thank you so much. I just accept it. sorry if I am slow. I am new in this process and still learning. One last question I have is that, since you mentioned about the ```re```, how I can remove the special characters from entire documents such ```$ # & * ```. thanks – Bilgin Jun 07 '19 at 02:05
  • @Bilgin I once wrote a post about [how to remove punctuation](https://stackoverflow.com/a/50444347/4909087). The simplest solution would be to use `str.replace`, and you can then progressively try better solutions if you need more performance. Good luck! – cs95 Jun 07 '19 at 02:07
3

Another way using pd.Series.apply:

d = dict(i.split(':') for i in d.split('\n'))
#{'bat': 'battery',
# 'ext': 'exhaust',
# 'gr': 'gear',
# 'mag': 'magnitude',
# 'ml': 'mileage',
# 'shf': 'shaft'}

df['messages'].apply(lambda x : ' '.join(d.get(i, i) for i in x.split()), 1)

Output:

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object
Chris
  • 29,127
  • 3
  • 28
  • 51