Remove unwanted parts from strings in a column

Question

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

Data looks like:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-') and .str.rstrip('aAbBcC'), but got an error:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!

score 262 · Accepted Answer · answered Dec 03 '12 at 11:33

262

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

answered Dec 03 '12 at 11:33

eumiro

207,213
34
299
261

1

thx! that works. I'm still wrapping my mind around map(), not sure when to use or not use it... – Yannan Wang Dec 03 '12 at 12:21
I was pleased to see that this method also works with the replace function. – BKay Jan 16 '13 at 21:27
@eumiro how do you apply this result if iterating each column? – medev21 Sep 01 '16 at 19:50
Can I use this function to replace a number such as the number 12? If I do x.lstrip('12') it takes out all 1 and 2s. – Dave Oct 26 '16 at 21:19
Hi @eumiro, I want to remove punctuation (only dot .) only after the letter `c` and `p` from a string. How can I do that? – Roy Mar 11 '22 at 00:00
Now is warning because "A value is trying to be set on a copy of a slice from a Data Frame", I was trying to solve it, but I was not able, any workaround? – Javier Huerta Jun 08 '22 at 15:37
@JavierHuerta it works fine for me. Do you have like data['result'] = data['result'].map() with the equal sign? – Gobrel Jun 15 '22 at 12:31

cs95 · Answer 2 · 2023-05-17T07:30:19.597

How do I remove unwanted parts from strings in a column?

6 years after the original question was posted, pandas now has a good number of "vectorised" string functions that can succinctly perform these string manipulation operations.

This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.

`.str.replace`

Specify the substring/pattern to match, and the substring to replace it with.

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '', regex=True)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If you need the result converted to an integer, you can use Series.astype,

df['result'] = df['result'].str.replace(r'\D', '', regex=True).astype(int)

df.dtypes
time      object
result     int64
dtype: object

If you don't want to modify df in-place, use DataFrame.assign:

df2 = df.assign(result=df['result'].str.replace(r'\D', '', regex=True))
df
# Unchanged

`.str.extract`

Useful for extracting the substring(s) you want to keep.

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

With extract, it is necessary to specify at least one capture group. expand=False will return a Series with the captured items from the first capture group.

###.str.split and .str.get
Splitting works assuming all your strings follow this consistent structure.

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

Do not recommend if you are looking for a general solution.

If you are satisfied with the succinct and readable str accessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.

Optimizing: List Comprehensions

In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.

My write-up, Are for-loops in pandas really bad? When should I care?, goes into greater detail.

The str.replace option can be re-written using re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

The str.extract example can be re-written using a list comprehension with re.search,

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan
 
p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

We can also re-write @eumiro's and @MonkeyButter's answers using list comprehensions:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

And,

df['result'] = [x[1:-1] for x in df['result']]

Same rules for handling NaNs, etc, apply.

Performance Comparison

Graphs generated using perfplot. Full code listing, for your reference. The relevant functions are listed below.

Some of these comparisons are unfair because they take advantage of the structure of OP's data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.

Functions

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))
    
def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

any workaround to avoid the settingwithcopywarning: `Try using .loc[row_indexer,col_indexer] = value instead` — PV8, Oct 02 '19 at 13:50
@PV8 not sure about your code, but check this out: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas/53954986#53954986 — cs95, Oct 02 '19 at 14:44
For anyone that's new to REGEX like me, \D is the same as [^\d] (anything that's not a digit) [from here](http://www.regular-expressions.info/shorthand.html). So we're basically replacing all the non-digits in the string with nothing. — Rishi Latchmepersad, May 02 '20 at 11:24
If you are using a newer version of pandas and using regex, you should pass in the argument `regex=True`, otherwise your regular expression might not work correctly. For example, `df['result'] = df['result'].str.replace(r'\D', '')` would be `df['result'] = df['result'].str.replace(r'\D', '', regex=True)`. (See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html for more information). — Alexandre Gaubil, May 16 '23 at 13:39
@AlexandreGaubil thanks! updated, feel free to suggest any other edits as necessary and i can take a look. — cs95, May 17 '23 at 07:30

score 63 · Answer 3 · answered Jan 31 '15 at 14:57

63

i'd use the pandas replace function, very simple and powerful as you can use regex. Below i'm using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

answered Jan 31 '15 at 14:57

Coder375

1,535
12
14

I tried this, and it doesn't work. I'm wondering if it only works when you want to replace an entire string instead of just replacing a substring part. – bgenchel May 04 '17 at 22:57
@bgenchel - I used this method to replace part of a string in a pd.Series: `df.loc[:, 'column_a'].replace(regex=True, to_replace="my_prefix", value="new_prefix")`. This will convert a string like "my_prefixaaa" to "new_prefixaaa". – jkr Jul 20 '17 at 20:12
what does the r do in to_replace=r'\D'? – Luca Guarro Jan 10 '20 at 00:10
@LucaGuarro from the python docs: "The r prefix, making the literal a raw string literal, is needed in this example because escape sequences in a normal “cooked” string literal that are not recognized by Python, as opposed to regular expressions, now result in a DeprecationWarning and will eventually become a SyntaxError." – Coder375 Jan 17 '20 at 23:18

score 41 · Answer 4 · answered Mar 06 '14 at 23:27

41

In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

Last character:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

data['result'] = data['result'].map(lambda x: str(x)[2:])

answered Mar 06 '14 at 23:27

prl900

4,029
4
33
40

I need to trim the geo coordinates to 8 characters (including (.), (-)) and in case if they are less than 8 I need to insert '0' at last to make all the coordinates 8 characters. What is simpler way to do so ? – Sitz Blogz Feb 28 '17 at 18:36
I don't fully understand your problem but you might need to change the lambda function to something like "{0:.8f}".format(x) – prl900 Mar 01 '17 at 20:56
Thank you so much for the reply. In simple words I have dataframe with geo coordinates -- latitude & longitude as two columns. The characters length is more than 8 characters and I was keep only 8 characters starting from first which should include (-) and (.) also. – Sitz Blogz Mar 02 '17 at 03:21

Wes McKinney · Answer 5 · 2012-12-07T16:31:06.253

18

There's a bug here: currently cannot pass arguments to str.lstrip and str.rstrip:

http://github.com/pydata/pandas/issues/2411

EDIT: 2012-12-07 this works now on the dev branch:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

edited Dec 07 '12 at 16:31

answered Dec 03 '12 at 17:00

Wes McKinney

101,437
32
142
108

score 11 · Answer 6 · edited Jan 22 '19 at 05:58

11

A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

edited Jan 22 '19 at 05:58

cs95

379,657
97
704
746

answered Nov 04 '17 at 04:00

Ted Petrou

59,042
19
131
136

score 7 · Answer 7 · answered Nov 15 '19 at 04:24

Suppose your DF is having those extra character in between numbers as well.The last entry.

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

You can try str.replace to remove characters not only from start and end but also from in between.

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

Output:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

tim654321 · Answer 8 · 2019-01-11T06:44:43.223

I often use list comprehensions for these types of tasks because they're often faster.

There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest - see code race below for this task:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

score 0 · Answer 9 · answered Nov 11 '19 at 07:08

0

Try this using regular expression:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

answered Nov 11 '19 at 07:08

WarBoy

146
1
11

score 0 · Answer 10 · answered May 04 '22 at 14:11

0

using "str.replace" is more fast than lambda and map when your data dimention is large:

your_data["result"]=your_data["result"].str.replace("+","")
your_data["result"]=your_data["result"].str.replace("-","")

answered May 04 '22 at 14:11

Ali karimi

371
3
10

Remove unwanted parts from strings in a column

10 Answers10

How do I remove unwanted parts from strings in a column?

`.str.replace`

`.str.extract`

Optimizing: List Comprehensions

Performance Comparison

Linked

Related