8

I would like to remove the first 3 characters from strings in a Dataframe column where the length of the string is > 4

If else they should remain the same.

E.g

bloomberg_ticker_y

AIM9
DJEM9 # (should be M9)
FAM9
IXPM9 # (should be M9)

I can filter the strings by length:

merged['bloomberg_ticker_y'].str.len() > 4

and slice the strings:

merged['bloomberg_ticker_y'].str[-2:]

But not sure how to put this together and apply it to my dataframe

Any help would be appreciated.

cs95
  • 379,657
  • 97
  • 704
  • 746
User63164
  • 103
  • 1
  • 7

6 Answers6

9

You can use a list comprehension :

df = pd.DataFrame({'bloomberg_ticker_y' : ['AIM9', 'DJEM9', 'FAM9', 'IXPM9']})

df['new'] = [x[-2:] if len(x)>4 else x for x in df['bloomberg_ticker_y']]

Output :

  bloomberg_ticker_y   new
0               AIM9  AIM9
1              DJEM9    M9
2               FAM9  FAM9
3              IXPM9    M9
vlemaistre
  • 3,301
  • 13
  • 30
8

You can use numpy.where to apply a condition to pick slices based on string length.

np.where(df['bloomberg_ticker_y'].str.len() > 4, 
         df['bloomberg_ticker_y'].str[3:], 
         df['bloomberg_ticker_y'])
# array(['AIM9', 'M9', 'FAM9', 'M9'], dtype=object)

df['bloomberg_ticker_sliced'] = (
   np.where(df['bloomberg_ticker_y'].str.len() > 4, 
            df['bloomberg_ticker_y'].str[3:], 
            df['bloomberg_ticker_y']))
df
  bloomberg_ticker_y bloomberg_ticker_sliced
0               AIM9                    AIM9
1              DJEM9                      M9
2               FAM9                    FAM9
3              IXPM9                      M9

If you fancy a vectorized map based solution, it is

df['bloomberg_ticker_y'].map(lambda x: x[3:] if len(x) > 4 else x)

0    AIM9
1      M9
2    FAM9
3      M9
Name: bloomberg_ticker_y, dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    Is there a difference between `str[3:]` and `str.slice(start=3)` in terms speed or are these the same? – Erfan Jul 01 '19 at 14:45
  • 1
    @Erfan They are the same! `.str[n:m]` is a shortcut for `.str.slice(start=n, end=m)`. – cs95 Jul 01 '19 at 14:45
  • 1
    Thanks, was expecting that. Then there's no use case for `str.slice` imo, only point is it's more readable? – Erfan Jul 01 '19 at 14:47
  • @Erfan I suppose. It's possible one came before the other, and both were preserved for backward compatibility. There could be various reasons to have many methods to do the same thing... :) – cs95 Jul 01 '19 at 14:49
5

Saw a quite big variety of answers, so decided to compare them in terms of speed:

# Create big size test dataframe
df = pd.DataFrame({'bloomberg_ticker_y' : ['AIM9', 'DJEM9', 'FAM9', 'IXPM9']})
df = pd.concat([df]*100000)
df.shape

#Out
(400000, 1)

CS95 #1 np.where

%%timeit 
np.where(df['bloomberg_ticker_y'].str.len() > 4, 
         df['bloomberg_ticker_y'].str[3:], 
         df['bloomberg_ticker_y'])

Result:

163 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CS95 #2 vectorized map based solution

%%timeit 
df['bloomberg_ticker_y'].map(lambda x: x[3:] if len(x) > 4 else x)

Result:

86 ms ± 7.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Yatu DataFrame.mask

%%timeit
df.bloomberg_ticker_y.mask(df.bloomberg_ticker_y.str.len().gt(4), 
                           other=df.bloomberg_ticker_y.str[-2:])

Result:

187 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Vlemaistre list comprehension

%%timeit
[x[-2:] if len(x)>4 else x for x in df['bloomberg_ticker_y']]

Result:

84.8 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

pault str.replace with regex

%%timeit
df["bloomberg_ticker_y"].str.replace(r".{3,}(?=.{2}$)", "")

Result:

324 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Cobra DataFrame.apply

%%timeit
df.apply(lambda x: (x['bloomberg_ticker_y'][3:] if len(x['bloomberg_ticker_y']) > 4 else x['bloomberg_ticker_y']) , axis=1)

Result:

6.83 s ± 387 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion

  • Fastest method is list comprehension closely followed by vectorized map based solution.

  • Slowest method is DataFrame.apply by far (as expected) followed by str.replace with regex

Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 1
    Interesting comparisons here! Did not think map with lambda would beat numpy, nor that list comprehensions would hold their performance for data this large. – cs95 Jul 01 '19 at 15:00
  • Ye definitely, I expected numpy to win as well. I might start to use list comprehension more for conditional comparisons :) @cs95 – Erfan Jul 01 '19 at 15:02
3

You can use DataFrame.mask:

df['bloomberg_ticker_y'] = (df.bloomberg_ticker_y.mask(
                                      df.bloomberg_ticker_y.str.len().gt(4), 
                                      other=df.bloomberg_ticker_y.str[-2:]))

       bloomberg_ticker_y
0               AIM9
1                 M9
2               FAM9
3                 M9
yatu
  • 86,083
  • 12
  • 84
  • 139
3

You can also use DataFrame.apply :

import pandas as pd

df = pd.DataFrame({'bloomberg_ticker_y' : ['AIM9', 'DJEM9', 'FAM9', 'IXPM9']})

df['bloomberg_ticker_y'] = df.apply(lambda x: (x['bloomberg_ticker_y'][3:] if len(x['bloomberg_ticker_y']) > 4 else x['bloomberg_ticker_y']) , axis=1)

Output :

  bloomberg_ticker_y
0               AIM9
1                 M9
2               FAM9
3                 M9
Cobra
  • 200
  • 9
  • 1
    This works! I upvote for the attempt. But for your learning curve, try to look at some of the answers above, which are considered best practice and will be much faster in terms of speed because of vectorization :). And btw, welcome StackOverflow :) – Erfan Jul 01 '19 at 15:07
  • If you are interested, I included your answer in the speed comparison [here](https://stackoverflow.com/a/56837930/9081267). – Erfan Jul 01 '19 at 15:13
  • @Erfan did you try to compare execution time with more 1000 executions? To be honest, I'm surprised that my solution is very slow, I think it's because the dataframe is small :) – Cobra Jul 01 '19 at 15:32
  • See the lines on the top with `df = pd.concat([df]*100000)`. I create a test dataframe with 400 000 rows. There I do these tests on. Its a general knowledge that `.apply` is very slow compared to the vectorized solutions `pandas`/`numpy` provide. Read more [here](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code/54432584#54432584) about it. You only need apply in very edge case scenarios, not for this simple problem. – Erfan Jul 01 '19 at 15:36
2

Another approach is to use regular expressions:

df["bloomberg_ticker_y"].str.replace(r".{3,}(?=.{2}$)", "")
#0    AIM9
#1      M9
#2    FAM9
#3      M9

The pattern means:

  • .{3,}: Match 3 or more characters
  • (?=.{2}$): Positive look ahead for exactly 2 characters followed by the end of the string.
pault
  • 41,343
  • 15
  • 107
  • 149