0

I am trying to loop through a column in a pandas data frame to remove unnecessary white space in the beginning and end of the strings within the column. My data frame looks like this:

df={'c1': [' ab', 'fg', 'ac ', 'hj-jk ', ' ac', 'df, gh', 'gh', 'ab', 'ad', 'jk-pl', 'ae', 'kl-kl '], 'b2': ['ba', 'bc', 'bd', 'be', 'be', 'be', 'ba'] }


    c1  b2
0   ab, fg
1   ac, hj-jk   
2   ac, df,gh   
3   gh, be
4   ab, be
5   ad, jk-pl
6   ae, kl-kl   

I tried the this answer here, but did not work either. The reason I need to remove the white space from the strings in this column is that I want to one hot encode this column using get.dummies() function. My idea was to use the strip() function to remove the white space from each value and then I use .str.get_dummies(','):

#function to remove white space from strings
def strip_string(dataframe, column_name):
  for id, item in dataframe[column_name].items():
    a=item.strip()

#removing the white space from the values of the column
strip_string(df, 'c1')

#creating one hot-encoded columns from the values using split(",")

df1=df['c1'].str.get_dummies(',')

but my code returns duplicate columns and I don't want this...I suppose the function to remove the white space is not working well? Can anyone help? My current output is:

   ab   ac  df  fg  gh  hj-jk   jk-pl   kl-kl   ab  ac  ad  ae  gh
0   1   0   0   1   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   1   0   0   0   1   0   0   0
2   0   1   1   0   1   0   0   0   0   0   0   0   0
3   0   0   0   0   0   0   0   0   0   0   0   0   1
4   0   0   0   0   0   0   0   0   1   0   0   0   0
5   0   0   0   0   0   0   1   0   0   0   1   0   0
6   0   0   0   0   0   0   0   1   0   0   0   1   0

columns 'ac' and 'ab' are duplicated. I want to remove the duplicated columns

Natália Resende
  • 309
  • 1
  • 17
  • Does this answer your question? [Is there a way to trim/strip whitespace in multiple columns of a pandas dataframe?](https://stackoverflow.com/questions/58287398/is-there-a-way-to-trim-strip-whitespace-in-multiple-columns-of-a-pandas-datafram) – Rodalm Jun 18 '22 at 18:37
  • `df['c1'].apply(str.strip)?` – SomeDude Jun 18 '22 at 18:42
  • 1
    I think this is a better answer: https://stackoverflow.com/questions/40950310/strip-trim-all-strings-of-a-dataframe – nonDucor Jun 18 '22 at 18:48
  • @nonDucor I agree, please mark it as duplicate also. Questions like this shouldn't be answered but closed. it's clearly a common problem and a duplicate question easy to find... – Rodalm Jun 18 '22 at 18:52
  • 1
    Does this answer your question? [Strip / trim all strings of a dataframe](https://stackoverflow.com/questions/40950310/strip-trim-all-strings-of-a-dataframe) – nonDucor Jun 18 '22 at 18:56
  • Natalia, please see the answer I posted above. It explains the warning you are seeing as well. – nonDucor Jun 18 '22 at 18:57
  • @Rodalm The use case for this question involves `Series.str.get_dummies()`. To make it work properly actually may require more that simply a call to `strip()` in one form or another. Perhaps the question heading/phrasing could be modified to better highlight this, but fundamentally what it's trying to solve is not the same as the earlier questions (unless I've missed a reference to `get_dummies()` therein). – constantstranger Jun 18 '22 at 19:19
  • @constantstranger I understand your point of view and agree with your suggestions of better rephrasing. However, the fundamental problem that OP wants to solve is regarding the `strip` part, as the title clearly states, which is the source of the error, as the OP recognizes. The problem description itself is more focused on the stripping part, and only adds the one-hot encoding part to give context (which doesn't give the OP any trouble). So the issue is fundamentally identical to those posted. – Rodalm Jun 18 '22 at 19:42
  • But I won't argue, and I agree that there are questions that are more clearly identified as duplicates than this one. – Rodalm Jun 18 '22 at 19:43

3 Answers3

1

UPDATED:

I think you need to handle spaces around commas as well as at the start/end of a string in order for Series.str.get_dummies() to work correctly for your example:

df = df.apply(lambda x: x.str.strip().str.replace(' *, *', ','))

Input:

        c1   b2
0       ab  foo
1       fg  foo
2      ac   foo
3   hj-jk   foo
4       ac  foo
5   df, gh  foo
6       gh  foo
7       ab  foo
8       ad  foo
9    jk-pl  foo
10      ae  foo
11  kl-kl   foo

Intermediate dataframe (after removing spaces at start and end and adjacent to commas):

       c1   b2
0      ab  foo
1      fg  foo
2      ac  foo
3   hj-jk  foo
4      ac  foo
5   df,gh  foo
6      gh  foo
7      ab  foo
8      ad  foo
9   jk-pl  foo
10     ae  foo
11  kl-kl  foo

Output:

    ab  ac  ad  ae  df  fg  gh  hj-jk  jk-pl  kl-kl
0    1   0   0   0   0   0   0      0      0      0
1    0   0   0   0   0   1   0      0      0      0
2    0   1   0   0   0   0   0      0      0      0
3    0   0   0   0   0   0   0      1      0      0
4    0   1   0   0   0   0   0      0      0      0
5    0   0   0   0   1   0   1      0      0      0
6    0   0   0   0   0   0   1      0      0      0
7    1   0   0   0   0   0   0      0      0      0
8    0   0   1   0   0   0   0      0      0      0
9    0   0   0   0   0   0   0      0      1      0
10   0   0   0   1   0   0   0      0      0      0
11   0   0   0   0   0   0   0      0      0      1

If you just use strip() (as in my earlier answer below), you will get something like this with a duplicate for gh:

     gh  ab  ac  ad  ae  df  fg  gh  hj-jk  jk-pl  kl-kl
0     0   1   0   0   0   0   0   0      0      0      0
1     0   0   0   0   0   0   1   0      0      0      0
2     0   0   1   0   0   0   0   0      0      0      0
3     0   0   0   0   0   0   0   0      1      0      0
4     0   0   1   0   0   0   0   0      0      0      0
5     1   0   0   0   0   1   0   0      0      0      0
6     0   0   0   0   0   0   0   1      0      0      0
7     0   1   0   0   0   0   0   0      0      0      0
8     0   0   0   1   0   0   0   0      0      0      0
9     0   0   0   0   0   0   0   0      0      1      0
10    0   0   0   0   1   0   0   0      0      0      0
11    0   0   0   0   0   0   0   0      0      0      1

EARLIER ANSWER:

Either of the following should work:

df = df.applymap(lambda x: x.strip())

... or:

df = df.apply(lambda x: x.str.strip())
constantstranger
  • 9,176
  • 2
  • 5
  • 19
  • this does not work in my case because I have strings with white space between words – Natália Resende Jun 18 '22 at 18:57
  • @Natália Resende I have made another update to my answer, as simply using `strip()` potentially leaves duplicate columns (such as `gh`). I now also remove spaces on either side of commas. – constantstranger Jun 18 '22 at 19:22
1

I would stack, strip, get_dummies, and groupby.max:

If the separator is ', ':

df.stack().str.strip().str.get_dummies(sep=', ').groupby(level=0).max()

else:

df.stack().str.replace(r'\s', '', regex=True).str.get_dummies(sep=',').groupby(level=0).max()

output:

   ab  ac  ba  bc  bd  be  df  fg  gh  hj-jk
0   1   0   1   0   0   0   0   0   0      0
1   0   0   0   1   0   0   0   1   0      0
2   0   1   0   0   1   0   0   0   0      0
3   0   0   0   0   0   1   0   0   0      1
4   0   1   0   0   0   1   0   0   0      0
5   0   0   0   0   0   1   1   0   1      0
6   0   0   1   0   0   0   0   0   1      0
mozway
  • 194,879
  • 13
  • 39
  • 75
  • In your solution, `gh` is duplicated. – constantstranger Jun 18 '22 at 19:14
  • @constantstranger good point, then I guess either use `', '` as separator or remove all spaces – mozway Jun 18 '22 at 19:17
  • It's potentially tricky because, according to OP in a comment on my answer, there are "strings with white space between words". I ended up updating mine to do strip() as well as regex removal of spaces on either side of commas. But perhaps there are other ways ... – constantstranger Jun 18 '22 at 19:20
  • @constantstranger then a regex to remove spaces only around comma or string start/end. In any case a better example should be provided – mozway Jun 18 '22 at 19:25
0

See if this helps:

import numpy as np
import pandas as pd
data={'c1': [' ab ', 'fg', 'ac ', 'hj-jk '], 'b2': ['ba', 'bc', 'bd', 'be'] }
df=pd.DataFrame(data)
print(df.head())
df=df.apply(lambda x: x.map(str.strip))
print(df.head())
teedak8s
  • 732
  • 5
  • 12