Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

Question

I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.

Here is what I'm doing:

def remove_whitespace( x ):
    if isinstance( x, basestring ):
        return x.strip()
    else:
        return x

my_data = my_data.applymap( remove_whitespace )

Is there a better or more idiomatic to Pandas way to do this?

Is there a more efficient way (perhaps by doing things column wise)?

I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.

What would happen if you were to do x.strip() on an element that is not an instance of a basestring? If there aren't any downsides maybe you could remove the check and replace it with a try and except block. That might speed things up. — Jeff Mandell, Nov 18 '15 at 19:45
Are you using `pandas.read_csv` (http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.read_csv.html)? Have you tried `skipinitialspace=True`? — Warren Weckesser, Nov 18 '15 at 19:48
@WarrenWeckesser I might have whitespace both before and after the data I care about in the cell, so I think skipinitialspace only fixes half the problem. — deadcode, Nov 18 '15 at 21:10
@JeffMandell - attempting to call .strip() on a non-stringlike object (or thing that doesn't have a strip method, such as a numeric data type) will raise an exception. You're right that handling the exception could be faster than doing the check - depending on the data and frequency of exceptions. — deadcode, Nov 18 '15 at 21:11
@deadcode The one concern with the solution that I offered is that an object other than a base string might have the strip() function. This could lead to a hard to notice bug. — Jeff Mandell, Nov 18 '15 at 21:36
This question looks very similar to, and was imho answered in a great way in https://stackoverflow.com/questions/40950310/strip-trim-all-values-of-a-dataframe — elke, Nov 12 '17 at 17:49
NameError: name 'basestring' is not defined. "The builtin basestring abstract type was removed [in python3]. Use str instead." — user2514157, Apr 09 '20 at 00:10

score 106 · Answer 1 · answered Jul 27 '17 at 15:50

106

Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:

data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

answered Jul 27 '17 at 15:50

Adam Owczarczyk

2,802
1
16
21

6

This did the job. In case even your headers have trailing spaces, add a `.rename(columns=lambda x: x.strip())` from https://stackoverflow.com/a/21607530/4355695 – Nikhil VJ Apr 14 '18 at 07:26
2

This method failed for me if Pandas labeled a column as `object`. Pandas will label columns with mixed datatypes as `object` which doesn't necessarily mean that every element in that column is a `str`. – Michael Silverstein Oct 31 '18 at 18:27
Michael, for truly mixed types, you can use OP method. – Adam Owczarczyk Nov 08 '18 at 12:34
1

maybe isinstance(x, str) would be a better solution to check if it a string or not... – hunsnowboarder Nov 16 '18 at 10:36
Another method is to use try except in a named function. def strip(x): result = x try: result = x.astype(unicode).str.strip() except AttributeError: pass return result data_frame_trimmed = data_frame.apply(strip) – Dayong May 08 '19 at 21:27
2

@MichaelSilverstein In many cases, you can first convert all columns to numeric with the `errors='ignore'` argument (this ignores any column that has values that would cause an error when casting it to numeric): `df.apply(pd.to_numeric, errors='ignore')`. Then you can more safely use this snippet to strip whitespace. – cdabel Oct 22 '19 at 20:07

score 46 · Answer 2 · answered Nov 18 '15 at 20:02

You could use pandas' Series.str.strip() method to do this quickly for each string-like column:

>>> data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
>>> data
      values
0     ABC   
1        DEF
2      GHI  

>>> data['values'].str.strip()
0    ABC
1    DEF
2    GHI
Name: values, dtype: object

score 17 · Answer 3 · answered Oct 31 '18 at 18:29

17

We want to:

Apply our function to each element in our dataframe - use applymap.
Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).
Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).

Therefore, I've found the following to be the easiest:

df.applymap(lambda x: x.strip() if type(x)==str else x)

answered Oct 31 '18 at 18:29

Michael Silverstein

1,653
15
17

1

An alternative syntax for checking datatypes: `isinstance(x, str)` instead of `type(x)==str` – Michael Silverstein Jul 28 '20 at 13:51
This seems to provide adequate level of precision without being overly aggressive and potentially destructive. – Sn3akyP3t3 Aug 06 '20 at 14:39

Warren Weckesser · Answer 4 · 2015-11-18T21:45:56.117

When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.

For example, here's "data.csv":

In [19]: !cat data.csv
1.5, aaa,  bbb ,  ddd     , 10 ,  XXX   
2.5, eee, fff  ,       ggg, 20 ,     YYY

(The first line ends with three spaces after XXX, while the second line ends at the last Y.)

The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)

In [20]: import pandas as pd

In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')

In [22]: df
Out[22]: 
     0    1    2    3   4    5
0  1.5  aaa  bbb  ddd  10  XXX
1  2.5  eee  fff  ggg  20  YYY

This is very close, but has a problem with quoted input values, such as this: "a, b, c ",d It can't pull the spaces off the first field which I would want stored in the dataframe's cell as 'a, b, c' — deadcode, Nov 19 '15 at 21:37

score 4 · Answer 5 · answered May 02 '17 at 18:37

The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.

import pandas as pd
data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
print ('-----')
print (data)

data['values'].str.strip()
print ('-----')
print (data)

new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)

Blake · Answer 6 · 2017-06-24T20:40:54.470

Here is a column-wise solution with pandas apply:

import numpy as np

def strip_obj(col):
    if col.dtypes == object:
        return (col.astype(str)
                   .str.strip()
                   .replace({'nan': np.nan}))
    return col

df = df.apply(strip_obj, axis=0)

This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.

score 1 · Answer 7 · answered Jul 12 '19 at 18:46

1

This worked for me - applies it to the whole dataframe:

def panda_strip(x):
    r =[]
    for y in x:
        if isinstance(y, str):
            y = y.strip()

        r.append(y)
    return pd.Series(r)

df = df.apply(lambda x: panda_strip(x))

answered Jul 12 '19 at 18:46

Saul Frank

395
5
15

1

`df.apply(panda_strip)` is more succinct. – Adam Wheeler Jul 01 '20 at 18:18

score 0 · Answer 8 · answered May 02 '18 at 20:07

I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.

import pandas as pd

def remove_whitespace(x):
    try:
        # remove spaces inside and outside of string
        x = "".join(x.split())

    except:
        pass
    return x

# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)


# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

8 Answers8

Linked

Related