14

I want to store in a new variable the last digit from a 'UserId' (such UserId is of type string).

I came up with this, but it's a long df and takes forever. Any tips on how to optimize/avoid for loop?

df['LastDigit'] = np.nan
for i in range(0,len(df['UserId'])):
    df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
prp
  • 914
  • 1
  • 9
  • 24

2 Answers2

24

Use str.strip with indexing by str[-1]:

df['LastDigit'] = df['UserId'].str.strip().str[-1]

If performance is important and no missing values use list comprehension:

df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]

Your solution is really slow, it is last solution from this:

6) updating an empty frame (e.g. using loc one-row-at-a-time)

Performance:

np.random.seed(456)
users = ['joe','jan ','ben','rick ','clare','mary','tom']
df = pd.DataFrame({
         'UserId': np.random.choice(users, size=1000),

})

In [139]: %%timeit
     ...: df['LastDigit'] = np.nan
     ...: for i in range(0,len(df['UserId'])):
     ...:     df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
     ...: 
__main__:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
57.9 s ± 1.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [140]: %timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
1.38 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [141]: %timeit df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
343 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    I do like Jeff's post there (and good job linking it!), but I'm surprised he never mentions list comprehensions (or `np.vectorize`) somewhere in the mix. Yes, they're both loops, but so are most of the alternatives. – jpp Oct 17 '18 at 10:30
  • @jezrael - why do you need the .str.strip()? isn't df['LastDigit'] = df['UserId'].str[-1] sufficient. – Krish Srinivasan Dec 21 '22 at 21:40
  • @KrishSrinivasan - I add `strip`, because in solution in question. – jezrael Dec 22 '22 at 07:35
  • 1
    @KrishSrinivasan - I had the same question. The strip() method will remove any leading or trailing whitespaces on the string values but doesn't return anything. You will need .str again to return strings, and then you can access the string items by referring to the index number. [i] or get(i). You can skip strip() if you know that the values in your column are clean. – JFMoya Mar 19 '23 at 20:40
3

Another option is to use apply. Not performant as the list comprehension but very flexible based on your goals. Here some tries on a random dataframe with shape (44289, 31)

%timeit df['LastDigit'] = df['UserId'].apply(lambda x: str(x)[-1]) #if some variables are not strings
12.4 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
31.5 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['LastDigit'] = [str(x).strip()[-1] for x in df['UserId']]
9.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
el_Rinaldo
  • 970
  • 9
  • 26