156

Cleaning the values of a multitype data frame in python/pandas, I want to trim the strings. I am currently doing it in two instructions :

import pandas as pd

df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])

df.replace('^\s+', '', regex=True, inplace=True) #front
df.replace('\s+$', '', regex=True, inplace=True) #end

df.values

This is quite slow, what could I improve ?

mxdbld
  • 16,747
  • 5
  • 34
  • 37

10 Answers10

271

You can use DataFrame.select_dtypes to select string columns and then apply function str.strip.

Notice: Values cannot be types like dicts or lists, because their dtypes is object.

df_obj = df.select_dtypes(['object'])
print (df_obj)
0    a  
1    c  

df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
print (df)

   0   1
0  a  10
1  c   5

But if there are only a few columns use str.strip:

df[0] = df[0].str.strip()
iacob
  • 20,084
  • 6
  • 92
  • 119
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    And SettingWithCopyWarning should be ignored in this case as explained https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas?answertab=oldest#tab-top – Hrvoje Dec 21 '18 at 09:09
  • 2
    If you have strings such as N/A you will want to add the parameter na_action="ignore") when doing df_obj.apply, or else pandas will convert those values to empty strings – Justin Furuness Jun 21 '22 at 22:56
136

Money Shot

Here's a compact version of using applymap with a straightforward lambda expression to call strip only when the value is of a string type:

df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

Full Example

A more complete example:

import pandas as pd


def trim_all_columns(df):
    """
    Trim whitespace from ends of each value across all series in dataframe
    """
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)


# simple example of trimming whitespace from data elements
df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
df = trim_all_columns(df)
print(df)


>>>
   0   1
0  a  10
1  c   5

Working Example

Here's a working example hosted by trinket: https://trinket.io/python3/e6ab7fb4ab

Jonathan B.
  • 2,742
  • 1
  • 21
  • 18
  • 1
    Hi @DaleKube ... I just tried this fresh on a new machine just as a sanity check and I get the same results as posted in the answer. Can you confirm whether you are using Python2 or Python3? I'm only using Python3 these days, but perhaps that might be a factor. If so, I'll note that in my posted answer if you are able to confirm. Thanks! – Jonathan B. Nov 21 '17 at 21:40
  • 1
    I deleted my comment. I found a bug in my code, and I can confirm that it now works like a charm. FYI, I am using Python 3. Sorry for the trouble. – Dale Kube Nov 22 '17 at 17:14
  • 1
    you should use `type(x) == str`, not `type(x) is str` – fjsj Jun 12 '19 at 20:09
  • 1
    @fjsj Thanks for the nudge. I've updated the example using PEP8 guidance favoring `isinstance(x, str)`. – Jonathan B. Jun 13 '19 at 15:55
  • nice solution!, this does not trim column names if i load df from a csv – csf Jul 08 '22 at 13:53
  • For future readers: the final `else` part is `else` from the whole part not its adjacent `if`. – zionpi Sep 16 '22 at 03:23
15

You can try:

df[0] = df[0].str.strip()

or more specifically for all string columns

non_numeric_columns = list(set(df.columns)-set(df._get_numeric_data().columns))
df[non_numeric_columns] = df[non_numeric_columns].apply(lambda x : str(x).strip())
Aakash Makwana
  • 734
  • 5
  • 9
13

If you really want to use regex, then

>>> df.replace('(^\s+|\s+$)', '', regex=True, inplace=True)
>>> df
   0   1
0  a  10
1  c   5

But it should be faster to do it like this:

>>> df[0] = df[0].str.strip()
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
10

You can use the apply function of the Series object:

>>> df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
>>> df[0][0]
'  a  '
>>> df[0] = df[0].apply(lambda x: x.strip())
>>> df[0][0]
'a'

Note the usage of strip and not the regex which is much faster

Another option - use the apply function of the DataFrame object:

>>> df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
>>> df.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)

   0   1
0  a  10
1  c   5
Dekel
  • 60,707
  • 10
  • 101
  • 129
3

Strip alone does not remove the inner extra spaces in a string. The workaround to this is to first replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.

# Import packages
import re 

# First inspect the dtypes of the dataframe
df.dtypes

# First replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.
df = df.applymap(lambda x: re.sub('\s+', ' ', x) if isinstance(x, str) else x)


# Then strip leading and trailing white spaces
df = df.apply(lambda x: x.str.strip() if isinstance(x, object) else x)
Jane Kathambi
  • 695
  • 6
  • 8
2

@jezrael answer is looking good. But if you want to get back the other (numeric/integer etc) columns as well in the final result set then you suppose need to merge back with original DataFrame.

If it is the case then you may use this approach,

df = df.apply(lambda x: x.str.strip() if x.dtype.name == 'object' else x, axis=0)

Thanks!

Jai K
  • 375
  • 1
  • 4
  • 12
1

Benchmarks for best answers:

bm = Benchmark()
df = pd.read_excel(
    path, 
    sheet_name=advantage_sheet_name, 
    parse_dates=True
)
bm.mark('Loaded')

# @jezrael 's answer (accepted answer)
dfClean_1 = df\
    .select_dtypes(['object'])\
    .apply(lambda x: x.str.strip())
bm.mark('Clean method 1')

# @Jonathan B. answer 
dfClean_2 = df\
    .applymap(lambda x: x.strip() if isinstance(x, str) else x)
bm.mark('Clean method 2')

#@MaxU - stop genocide of UA / @Roman Pekar answer 
dfClean_3 = df\
    .replace(r'\s*(.*?)\s*', r'\1', regex=True)
bm.mark('Clean method 3')

Results

145.734375 - 145.734375 : Loaded
147.765625 - 2.03125 : Clean method 1
155.109375 - 7.34375 : Clean method 2
288.953125 - 133.84375 : Clean method 3
Jamie Marshall
  • 1,885
  • 3
  • 27
  • 50
0

how about (for string columns)

df[col] = df[col].str.replace(" ","")

never fails

Talis
  • 283
  • 3
  • 13
  • 2
    This would not only strip the ends of the string but also all the spaces within the string itself – skjerns Nov 23 '22 at 17:56
-3
def trim(x):
    if x.dtype == object:
        x = x.str.split(' ').str[0]
    return(x)

df = df.apply(trim)
hyunwoo jeong
  • 1,534
  • 1
  • 15
  • 14
  • 1
    Could you explain what the function is doing please? – CJ Dennis May 09 '18 at 03:46
  • for example, I encounter data such like this in my daily job: `가나다 봻` left part of blank is what I want, right part is garbage. trim function extract what I want from raw data. – hyunwoo jeong May 09 '18 at 08:17
  • 1
    Downvoted because this does not trim the string, it removes everything following the first space. This is not the behaviour asked for in the question, and introduces side-effects that a reader may not be expecting. Moreover, the side-effects may not be immediately apparent. If you are trying to trim a column of Last Names, you might think this is working as intended because most people don't have multiple last names and trailing spaces are yes removed. Then a Portuguese person with two Last Names joins your site and the code trims away their last Last Name, leaving only their first Last Name. – scottclowe Nov 24 '19 at 20:52