0

I am trying to accomplish a simple task of trimming all whitespace across every column in my dataframe. I have some values that have trailing spaces after words, before words, and some columns that only contain a " " value. I want all of that stripped.

I read this post which gave a great way to accomplish this: data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

However, I frequently get the following:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-31d35db1d48c> in <module>
      1 df = (pd.read_csv('C:\\Users\\wundermahn\Desktop\\aggregated_po_data.csv',
----> 2                     encoding = "ISO-8859-1", low_memory=False).apply(lambda x: x.str.strip() if (x.dtype == "object") else x))
      3 print(df.shape)
      4 
      5 label = df['class']

c:\python367-64\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   6876             kwds=kwds,
   6877         )
-> 6878         return op.get_result()
   6879 
   6880     def applymap(self, func) -> "DataFrame":

c:\python367-64\lib\site-packages\pandas\core\apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

c:\python367-64\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    294             try:
    295                 result = libreduction.compute_reduction(
--> 296                     values, self.f, axis=self.axis, dummy=dummy, labels=labels
    297                 )
    298             except ValueError as err:

pandas\_libs\reduction.pyx in pandas._libs.reduction.compute_reduction()

pandas\_libs\reduction.pyx in pandas._libs.reduction.Reducer.get_result()

<ipython-input-9-31d35db1d48c> in <lambda>(x)
      1 df = (pd.read_csv('C:\\Users\\wundermahn\Desktop\\aggregated_data.csv',
----> 2                     encoding = "ISO-8859-1", low_memory=False).apply(lambda x: x.str.strip() if (x.dtype == "object") else x))
      3 print(df.shape)
      4 
      5 label = df['ON_TIME']

c:\python367-64\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5268             or name in self._accessors
   5269         ):
-> 5270             return object.__getattribute__(self, name)
   5271         else:
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):

c:\python367-64\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    185             # we're accessing the attribute of the class, i.e., Dataset.geo
    186             return self._accessor
--> 187         accessor_obj = self._accessor(obj)
    188         # Replace the property with the accessor object. Inspired by:
    189         # http://www.pydanny.com/cached-property.html

c:\python367-64\lib\site-packages\pandas\core\strings.py in __init__(self, data)
   2039 
   2040     def __init__(self, data):
-> 2041         self._inferred_dtype = self._validate(data)
   2042         self._is_categorical = is_categorical_dtype(data)
   2043         self._is_string = data.dtype.name == "string"

c:\python367-64\lib\site-packages\pandas\core\strings.py in _validate(data)
   2096 
   2097         if inferred_dtype not in allowed_types:
-> 2098             raise AttributeError("Can only use .str accessor with string values!")
   2099         return inferred_dtype
   2100 

**AttributeError: Can only use .str accessor with string values!**

So, in trying to find a workaround, I stumbled upon this post, which suggests using:

data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "str" else x)

But, that doesn't strip away empty cells that just contain spaces or tabs.

How can I efficiently strip away all variants of white space? I ultimately am going to drop columns with more than 50% null values.

artemis
  • 6,857
  • 11
  • 46
  • 99
  • Do you know in advance what's the type of your columns? – Riccardo Bucco Apr 02 '20 at 14:02
  • I do not. That's the issue -- the dataframe is sort of given to me in the form of a `sql` query. I don't know much about the data at all. And the shape is about `(401801, 267)`, so it is quite cumbersome to attempt to go through column by column. I am sorry @RiccardoBucco – artemis Apr 02 '20 at 14:29

3 Answers3

1

You can try try instead:

def trim(x):
    try:
        return x.str.strip()
    except:
        return x

df = df.apply(trim)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

You have to check not column type, but the type of each individual value, so the code can be e.g.:

df.applymap(lambda x: x.strip() if type(x) == str else x)

The reason is that:

  • there can be a column of object type,
  • containing a string in almost all cells,
  • but some of them can be NaN which is a special case of float, hence you can not call strip on it.

But this way you execute unnecessarily the code for columns of types other than object, where nothing will be changed. If this bothers you, run this code only for columns in which it is likely to change anything:

cols = df.select_dtypes(include='object').columns
df[cols] = df[cols].applymap(lambda x: x.strip() if type(x) == str else x)
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

Using select_dtypes first to select correct columns:

# example dataframe
df = pd.DataFrame({'col1':[1,2,3],
                   'col2':list('abc'),
                   'col3':[4.0, 5.0, 6.0],
                   'col4':[' foo', '   bar', 'foobar. ']})

   col1 col2  col3      col4
0     1    a   4.0       foo
1     2    b   5.0       bar
2     3    c   6.0  foobar. 
str_cols = df.select_dtypes('object').columns
df[str_cols] = df[str_cols].apply(lambda x: x.str.strip())

print(df)
   col1 col2  col3     col4
0     1    a   4.0      foo
1     2    b   5.0      bar
2     3    c   6.0  foobar.
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Your approach is similar to OP's. It will fail if there is a column with, e.g., list values. – Quang Hoang Apr 02 '20 at 14:12
  • This is a nice attempt! I didn't approach it this way, but what I found is that `object` columns contain different values. For example, sometimes `A` and sometimes 1 and sometimes `" "` – artemis Apr 02 '20 at 14:30