3

This is very closely related to Removing space from columns in pandas so I wasn't sure whether to add it to a comment to that... the difference in my question is specifically relating to the use of a loc locator to slice out a subset...

df['py'] = df['py'].str.replace(' ','') 

-- this works fine; but when I only want to apply it on the subset of rows where the column subset is 'foo':

df.loc[df['column'] == 'foo']['py'] = df.loc[df['column'] == 'foo']['py'].str.replace(' ','')

...doesn't work.

What am I doing wrong? I can always slice out the group and re-append it, but curious where I'm going wrong here.

A dataset for trials:

df = pd.DataFrame({'column':['foo','foo','bar','bar'], 'py':['a b','a b','a b','a b']})

Thanks

BAC83
  • 811
  • 1
  • 12
  • 27
  • 2
    You should be getting a huge red warning explaining that the issue is chained **assignment** `][`. You need to assign properly with `df.loc[df['column'] == 'foo', 'py'] = ` (Since on the RHS you are just _selecting_ the chaining is _okay_ and doesn't cause problems, but still for best practices just select within the one loc call there too) – ALollz Oct 06 '21 at 14:41

2 Answers2

2

You want:

df.loc[df['column'] == 'foo', 'py'].apply(lambda x: x.replace(' ',''))

Note the notation of loc.

ALollz
  • 57,915
  • 7
  • 66
  • 89
vtasca
  • 1,660
  • 11
  • 17
  • I don't like `apply()` for performance reasons. – Freek Wiekmeijer Oct 06 '21 at 14:46
  • 2
    @FreekWiekmeijer the `.str` accessor operations themselves are essentially loops so there's little difference between an apply and the Series.str operations (in contrast to most of the vectorized math operations where `.apply` is to be avoided at all costs). For reference: https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care – ALollz Oct 06 '21 at 14:48
0

Pandas StringAccessor also supports regex

>>> pd.DataFrame({"column_1": ["hello ", " world", "space in the middle", "two  spaces", "one\ttab"]}).column_1.str.replace(r"\s+", "")

0               hello
1               world
2    spaceinthemiddle
3           twospaces
4              onetab

Combine that with numpy.where() and I think you have what you need.

np.where(
   <condition>,  # defines the loc which rows to edit
   df[column_name].str.replace(r"\s+", ""),  # the substitution to make in that loc
   df[column_name]  # the default value used on other rows
)
Freek Wiekmeijer
  • 4,556
  • 30
  • 37