How to replace all words in a series with a few specified words in Pandas,Python?

Question

I want to essentially find and replace using python.

However, I want to say if a cell contains something, then replace with what I want.

I know

str.replace('safsd','something else')

However, I am not sure how to specify how to get rid of EVERYTHING in that cell. Do I use *? I am not too familiar with that in python but I know in the bash shell * references to everything...

I have

df['Description']

that can contain 'optiplex 9010 for classes and research' which I just want to replace with 'optiplex 9010'. Or 'macbook air 11 with configurations...etc.' and I want simply 'macbook air 11'

I am aiming for...

if  Df['Description'].str.contains('macbook air 11')
  then Df['Description'].str.replace(' (not sure what I put in here) , 'mabook air 11')

Any help/ideas?

Thanks!

**Additional info that may be helfpul...

I am working with thousands of different user inputs. So the 'Descriptions' of what someone has purchased is not going to be the same at all in context, wording, structure, etc. etc. I can either manually go into excel and filter by what contains 'optiplex 9010' and then replace everything with a simple description , doing the same for macbooks, etc.

I figured there may be some simpler way using pandas/python .str.contains and .str.replace.

Hope that extra info helps! Let me know

Is there anything consistent about these strings? You could always take the first 3 words, or you could split on some delimiter character, etc... But if they are always different you may have to create special rules for each, why not just fix them at that point? — bnjmn, Feb 20 '14 at 21:35
I think regex is probably not the correct way to determine which topic it's in, it's possible you can get away with it though... — Andy Hayden, Feb 20 '14 at 21:43
You would probably be better off building a dict to hold your reduced/normalised key values with your desired replacement strings, then do a parse of your current string values by reducing/normalising them and then perform a lookup on the dict and replace the current value with your dict value. This is no different to what partial string matches perform in search engines. It is not something that is basic. You may need to use a library like nltk or similar to perform the initial match — EdChum, Feb 20 '14 at 21:46

score 5 · Accepted Answer · edited May 23 '17 at 12:14

5

str.replace takes a regular expression, for example 'macbook air 11' followed zero (or more) (*) of any characters (.) (you could also flag to be case insensitive):

Df['Description'].str.replace('macbook air 11.*' , 'macbook air 11')

A little primer on regex can be found here.

However, you might be better off, especially if you have already have a complete list of topics, to normalize the names (e.g. using fuzzywuzzy like in this question / answer):

from fuzzywuzzy.fuzz import partial_ratio
Df['Description'].apply(lambda x: max(topics, key=lambda t: partial_ratio(x, t)))

edited May 23 '17 at 12:14

Community

1
1

answered Feb 20 '14 at 21:36

Andy Hayden

359,921
101
625
535

How would I represent everything before the 'macbook air 11' for example if the description was 'one computer-macbook air 11' and I wanted to replace this with 'macbook air 11' – Alexis Feb 20 '14 at 22:21
@Alexis `.*macbook air 11.*` should work (which is equivalent to contains), I recommend giving the fuzzywuzzy solution a try! – Andy Hayden Feb 20 '14 at 22:28
It's an interesting route, I will definitely take a look at it thanks – Alexis Feb 20 '14 at 22:42

dawg · Answer 2 · 2014-02-20T22:05:25.960

0

You can use a regex on a Pandas series like so.

First create a dumb series of strings:

>>> import re
>>> import pandas as pd
>>> s=pd.Series(['Value {} of 3'.format(e) for e in range(1,4)])
>>> s
0     Value 1 of 3
1     Value 2 of 3
2     Value 3 of 3

Then use a regex sub to replace the string value of all digits with 5 and lower case the string:

>>> s.apply(lambda s: re.sub(r'\d+', '5', s).lower())
0    value 5 of 5
1    value 5 of 5
2    value 5 of 5
dtype: object

Of course if you want to just replace all, you can use a regex or string replace:

>>> s.apply(lambda s: re.sub(r'^.*$', 'GONE!!!', s))
0    GONE!!!
1    GONE!!!
2    GONE!!!
dtype: object
>>> s.apply(lambda s: s.replace(s, 'GONE!!!'))
0    GONE!!!
1    GONE!!!
2    GONE!!!
dtype: object

edited Feb 20 '14 at 22:05

answered Feb 20 '14 at 21:40

dawg

98,345
23
131
206

Interesting, How would I get it to replace everything in the cell...essentially deleting the content and then putting back what I want. I see that all of the 1's are replaced by 5 but what I would want would be to have everything that has a 1 replace it with only a five so 10 would become 5 and 11 would be come 5 only not 55 , etc – Alexis Feb 20 '14 at 21:45
Just do your regex accordingly. – dawg Feb 20 '14 at 21:50
Awesome! I was missing the regex symbols to represent 'everything' in the cell – Alexis Feb 20 '14 at 22:11

score -1 · Answer 3 · answered Feb 20 '14 at 21:55

This is a perfect example of a problem that can be solved using regexes. And I also find that a situation like this is a great excuse to learn about them! Here is an incredibly detailed tutorial on how to use regexes http://www.regular-expressions.info/tutorial.html

How to replace all words in a series with a few specified words in Pandas,Python?

3 Answers3