Strip / trim all strings of a dataframe

Question

Cleaning the values of a multitype data frame in python/pandas, I want to trim the strings. I am currently doing it in two instructions :

import pandas as pd

df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])

df.replace('^\s+', '', regex=True, inplace=True) #front
df.replace('\s+$', '', regex=True, inplace=True) #end

df.values

This is quite slow, what could I improve ?

This is the best answer, just logged in to up-vote the answer by @MaxU — Linkon, Sep 28 '20 at 15:54
I added benchmarks for answers below. Please be aware that the one in the comments here is very slow. Use appropriately. — Jamie Marshall, Jul 08 '22 at 16:47

score 271 · Accepted Answer · edited Aug 09 '18 at 12:36

271

You can use DataFrame.select_dtypes to select string columns and then apply function str.strip.

Notice: Values cannot be types like dicts or lists, because their dtypes is object.

df_obj = df.select_dtypes(['object'])
print (df_obj)
0    a  
1    c  

df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
print (df)

   0   1
0  a  10
1  c   5

But if there are only a few columns use str.strip:

df[0] = df[0].str.strip()

edited Aug 09 '18 at 12:36

iacob

20,084
6
92
119

answered Dec 03 '16 at 17:25

jezrael

822,522
95
1,334
1,252

2

And SettingWithCopyWarning should be ignored in this case as explained https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas?answertab=oldest#tab-top – Hrvoje Dec 21 '18 at 09:09
2

If you have strings such as N/A you will want to add the parameter na_action="ignore") when doing df_obj.apply, or else pandas will convert those values to empty strings – Justin Furuness Jun 21 '22 at 22:56

Jonathan B. · Answer 2 · 2019-06-13T15:54:43.123

136

Money Shot

Here's a compact version of using applymap with a straightforward lambda expression to call strip only when the value is of a string type:

df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

Full Example

A more complete example:

import pandas as pd


def trim_all_columns(df):
    """
    Trim whitespace from ends of each value across all series in dataframe
    """
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)


# simple example of trimming whitespace from data elements
df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
df = trim_all_columns(df)
print(df)


>>>
   0   1
0  a  10
1  c   5

Working Example

Here's a working example hosted by trinket: https://trinket.io/python3/e6ab7fb4ab

edited Jun 13 '19 at 15:54

answered Jul 23 '17 at 22:34

Jonathan B.

2,742
1
21
18

1

Hi @DaleKube ... I just tried this fresh on a new machine just as a sanity check and I get the same results as posted in the answer. Can you confirm whether you are using Python2 or Python3? I'm only using Python3 these days, but perhaps that might be a factor. If so, I'll note that in my posted answer if you are able to confirm. Thanks! – Jonathan B. Nov 21 '17 at 21:40
1

I deleted my comment. I found a bug in my code, and I can confirm that it now works like a charm. FYI, I am using Python 3. Sorry for the trouble. – Dale Kube Nov 22 '17 at 17:14
1

you should use `type(x) == str`, not `type(x) is str` – fjsj Jun 12 '19 at 20:09
1

@fjsj Thanks for the nudge. I've updated the example using PEP8 guidance favoring `isinstance(x, str)`. – Jonathan B. Jun 13 '19 at 15:55
nice solution!, this does not trim column names if i load df from a csv – csf Jul 08 '22 at 13:53
For future readers: the final `else` part is `else` from the whole part not its adjacent `if`. – zionpi Sep 16 '22 at 03:23

score 15 · Answer 3 · answered Dec 03 '16 at 17:27

15

You can try:

df[0] = df[0].str.strip()

or more specifically for all string columns

non_numeric_columns = list(set(df.columns)-set(df._get_numeric_data().columns))
df[non_numeric_columns] = df[non_numeric_columns].apply(lambda x : str(x).strip())

answered Dec 03 '16 at 17:27

Aakash Makwana

734
5
9

this will fail if you have nans – Tushar Seth May 05 '22 at 13:01

score 13 · Answer 4 · answered Dec 03 '16 at 17:26

13

If you really want to use regex, then

>>> df.replace('(^\s+|\s+$)', '', regex=True, inplace=True)
>>> df
   0   1
0  a  10
1  c   5

But it should be faster to do it like this:

>>> df[0] = df[0].str.strip()

answered Dec 03 '16 at 17:26

Roman Pekar

107,110
28
195
197

Dekel · Answer 5 · 2016-12-03T17:24:27.617

10

You can use the apply function of the Series object:

>>> df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
>>> df[0][0]
'  a  '
>>> df[0] = df[0].apply(lambda x: x.strip())
>>> df[0][0]
'a'

Note the usage of strip and not the regex which is much faster

Another option - use the apply function of the DataFrame object:

>>> df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
>>> df.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)

   0   1
0  a  10
1  c   5

edited Dec 03 '16 at 17:24

answered Dec 03 '16 at 17:18

Dekel

60,707
10
101
129

2

`df[0] = df[0].str.strip()` - will, most probably, be faster on bigger DFs – MaxU - stand with Ukraine Dec 03 '16 at 17:24

Jane Kathambi · Answer 6 · 2022-05-20T16:25:35.477

Strip alone does not remove the inner extra spaces in a string. The workaround to this is to first replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.

# Import packages
import re 

# First inspect the dtypes of the dataframe
df.dtypes

# First replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.
df = df.applymap(lambda x: re.sub('\s+', ' ', x) if isinstance(x, str) else x)


# Then strip leading and trailing white spaces
df = df.apply(lambda x: x.str.strip() if isinstance(x, object) else x)

score 2 · Answer 7 · answered Apr 08 '22 at 07:48

@jezrael answer is looking good. But if you want to get back the other (numeric/integer etc) columns as well in the final result set then you suppose need to merge back with original DataFrame.

If it is the case then you may use this approach,

df = df.apply(lambda x: x.str.strip() if x.dtype.name == 'object' else x, axis=0)

Thanks!

score 1 · Answer 8 · answered Jul 08 '22 at 16:45

Benchmarks for best answers:

bm = Benchmark()
df = pd.read_excel(
    path, 
    sheet_name=advantage_sheet_name, 
    parse_dates=True
)
bm.mark('Loaded')

# @jezrael 's answer (accepted answer)
dfClean_1 = df\
    .select_dtypes(['object'])\
    .apply(lambda x: x.str.strip())
bm.mark('Clean method 1')

# @Jonathan B. answer 
dfClean_2 = df\
    .applymap(lambda x: x.strip() if isinstance(x, str) else x)
bm.mark('Clean method 2')

#@MaxU - stop genocide of UA / @Roman Pekar answer 
dfClean_3 = df\
    .replace(r'\s*(.*?)\s*', r'\1', regex=True)
bm.mark('Clean method 3')

Results

145.734375 - 145.734375 : Loaded
147.765625 - 2.03125 : Clean method 1
155.109375 - 7.34375 : Clean method 2
288.953125 - 133.84375 : Clean method 3

score 0 · Answer 9 · answered Jul 26 '21 at 13:10

0

how about (for string columns)

df[col] = df[col].str.replace(" ","")

never fails

answered Jul 26 '21 at 13:10

Talis

283
3
13

2

This would not only strip the ends of the string but also all the spaces within the string itself – skjerns Nov 23 '22 at 17:56

score -3 · Answer 10 · answered May 09 '18 at 00:16

-3

def trim(x):
    if x.dtype == object:
        x = x.str.split(' ').str[0]
    return(x)

df = df.apply(trim)

answered May 09 '18 at 00:16

hyunwoo jeong

1,534
1
15
14

1

Could you explain what the function is doing please? – CJ Dennis May 09 '18 at 03:46
for example, I encounter data such like this in my daily job: `가나다 봻` left part of blank is what I want, right part is garbage. trim function extract what I want from raw data. – hyunwoo jeong May 09 '18 at 08:17
1

Downvoted because this does not trim the string, it removes everything following the first space. This is not the behaviour asked for in the question, and introduces side-effects that a reader may not be expecting. Moreover, the side-effects may not be immediately apparent. If you are trying to trim a column of Last Names, you might think this is working as intended because most people don't have multiple last names and trailing spaces are yes removed. Then a Portuguese person with two Last Names joins your site and the code trims away their last Last Name, leaving only their first Last Name. – scottclowe Nov 24 '19 at 20:52

Strip / trim all strings of a dataframe

10 Answers10

Money Shot

Full Example

Working Example

Linked

Related