68

I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.

Here is what I'm doing:

def remove_whitespace( x ):
    if isinstance( x, basestring ):
        return x.strip()
    else:
        return x

my_data = my_data.applymap( remove_whitespace )

Is there a better or more idiomatic to Pandas way to do this?

Is there a more efficient way (perhaps by doing things column wise)?

I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.

Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
deadcode
  • 2,226
  • 1
  • 20
  • 29
  • 1
    What would happen if you were to do x.strip() on an element that is not an instance of a basestring? If there aren't any downsides maybe you could remove the check and replace it with a try and except block. That might speed things up. – Jeff Mandell Nov 18 '15 at 19:45
  • 3
    Are you using `pandas.read_csv` (http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.read_csv.html)? Have you tried `skipinitialspace=True`? – Warren Weckesser Nov 18 '15 at 19:48
  • @WarrenWeckesser I might have whitespace both before and after the data I care about in the cell, so I think skipinitialspace only fixes half the problem. – deadcode Nov 18 '15 at 21:10
  • 1
    @JeffMandell - attempting to call .strip() on a non-stringlike object (or thing that doesn't have a strip method, such as a numeric data type) will raise an exception. You're right that handling the exception could be faster than doing the check - depending on the data and frequency of exceptions. – deadcode Nov 18 '15 at 21:11
  • @deadcode I was afraid that might be the case. – Warren Weckesser Nov 18 '15 at 21:23
  • @deadcode The one concern with the solution that I offered is that an object other than a base string might have the strip() function. This could lead to a hard to notice bug. – Jeff Mandell Nov 18 '15 at 21:36
  • This question looks very similar to, and was imho answered in a great way in https://stackoverflow.com/questions/40950310/strip-trim-all-values-of-a-dataframe – elke Nov 12 '17 at 17:49
  • NameError: name 'basestring' is not defined. "The builtin basestring abstract type was removed [in python3]. Use str instead." – user2514157 Apr 09 '20 at 00:10

8 Answers8

106

Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:

data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
Adam Owczarczyk
  • 2,802
  • 1
  • 16
  • 21
  • 6
    This did the job. In case even your headers have trailing spaces, add a `.rename(columns=lambda x: x.strip())` from https://stackoverflow.com/a/21607530/4355695 – Nikhil VJ Apr 14 '18 at 07:26
  • 2
    This method failed for me if Pandas labeled a column as `object`. Pandas will label columns with mixed datatypes as `object` which doesn't necessarily mean that every element in that column is a `str`. – Michael Silverstein Oct 31 '18 at 18:27
  • Michael, for truly mixed types, you can use OP method. – Adam Owczarczyk Nov 08 '18 at 12:34
  • 1
    maybe isinstance(x, str) would be a better solution to check if it a string or not... – hunsnowboarder Nov 16 '18 at 10:36
  • Another method is to use try except in a named function. def strip(x): result = x try: result = x.astype(unicode).str.strip() except AttributeError: pass return result data_frame_trimmed = data_frame.apply(strip) – Dayong May 08 '19 at 21:27
  • 2
    @MichaelSilverstein In many cases, you can first convert all columns to numeric with the `errors='ignore'` argument (this ignores any column that has values that would cause an error when casting it to numeric): `df.apply(pd.to_numeric, errors='ignore')`. Then you can more safely use this snippet to strip whitespace. – cdabel Oct 22 '19 at 20:07
46

You could use pandas' Series.str.strip() method to do this quickly for each string-like column:

>>> data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
>>> data
      values
0     ABC   
1        DEF
2      GHI  

>>> data['values'].str.strip()
0    ABC
1    DEF
2    GHI
Name: values, dtype: object
jakevdp
  • 77,104
  • 11
  • 125
  • 160
17

We want to:

  1. Apply our function to each element in our dataframe - use applymap.

  2. Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).

  3. Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).

Therefore, I've found the following to be the easiest:

df.applymap(lambda x: x.strip() if type(x)==str else x)

Michael Silverstein
  • 1,653
  • 15
  • 17
13

When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.

For example, here's "data.csv":

In [19]: !cat data.csv
1.5, aaa,  bbb ,  ddd     , 10 ,  XXX   
2.5, eee, fff  ,       ggg, 20 ,     YYY

(The first line ends with three spaces after XXX, while the second line ends at the last Y.)

The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)

In [20]: import pandas as pd

In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')

In [22]: df
Out[22]: 
     0    1    2    3   4    5
0  1.5  aaa  bbb  ddd  10  XXX
1  2.5  eee  fff  ggg  20  YYY
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • 2
    This is very close, but has a problem with quoted input values, such as this: "a, b, c ",d It can't pull the spaces off the first field which I would want stored in the dataframe's cell as 'a, b, c' – deadcode Nov 19 '15 at 21:37
4

The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.

import pandas as pd
data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
print ('-----')
print (data)

data['values'].str.strip()
print ('-----')
print (data)

new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)
S. Herron
  • 69
  • 2
1

Here is a column-wise solution with pandas apply:

import numpy as np

def strip_obj(col):
    if col.dtypes == object:
        return (col.astype(str)
                   .str.strip()
                   .replace({'nan': np.nan}))
    return col

df = df.apply(strip_obj, axis=0)

This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.

Blake
  • 46
  • 3
1

This worked for me - applies it to the whole dataframe:

def panda_strip(x):
    r =[]
    for y in x:
        if isinstance(y, str):
            y = y.strip()

        r.append(y)
    return pd.Series(r)

df = df.apply(lambda x: panda_strip(x))
Saul Frank
  • 395
  • 5
  • 15
0

I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.

import pandas as pd

def remove_whitespace(x):
    try:
        # remove spaces inside and outside of string
        x = "".join(x.split())

    except:
        pass
    return x

# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)


# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)
FunnyChef
  • 1,880
  • 3
  • 18
  • 30