0

I have to cleanup the dataframe column "id" to ensure that each value has a length of five and any value that is less than five needs to be prepended with zeros.

The following code works great on a small dataframe however, when I run for loop against my larger datagrame of ~500k rows, it still wasn't finished after 30 min.

#sample dataframe
df1 = pd.DataFrame({'id': ['676', '1931'],
                   'fu': ['bar', 'baz']})

# for loop used to update id
for id in df1['id']:
    if len(id) < 5:
        delta = (5 - len(id))
        new_id = ("0" * delta) + id
        df1.loc[df1['id'] == id, 'id'] = new_id

Can I speed this up? Is there an alternative to .loc that I could use?

FunnyChef
  • 1,880
  • 3
  • 18
  • 30

1 Answers1

3

Try to use zfill

df1.id.str.zfill(5)

Out[430]:
0        00676
1        01931
Name: id, dtype: object

Timing:

Sample:

In [450]: df1
Out[450]:
               id   fu
0             676  bar
1            1931  baz
2       561293964  NaN
3             676  bar
4            1931  baz
...           ...  ...
599995       1931  baz
599996  561293964  NaN
599997        676  bar
599998       1931  baz
599999  561293964  NaN

[600000 rows x 2 columns]

In [451]: %timeit df1.id.str.zfill(5)
372 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Ansev solution:

In [452]: %timeit (5 - df1['id'].str.len()).where(lambda x: x.gt(0), 0).map(lambda x: '0'*x) + df1['id']
541 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • I think this is more elegant but slower – ansev Aug 26 '20 at 16:52
  • 1
    @ansev my machine did this to 500k rows in a couple of milliseconds – Paul H Aug 26 '20 at 17:01
  • @PaulH: Thanks for the testing. I am really frown upon when seeing explicit loop using in pandas, so I give him an alternative. I didn't expect it is that fast on 500K rows :) – Andy L. Aug 26 '20 at 17:04
  • @ansev you should add your solution to one of those dupe questions. Yours is definitely faster than `...str.zfill9x)` – wwii Aug 26 '20 at 17:12
  • @wwii: your comment piques my curiosity, so I tested both solutions on sample of 600K rows. The testing shows `zfill` still faster than @ansev solution. – Andy L. Aug 26 '20 at 17:30
  • 1
    ... My bad I looked at the order of the results wrong zfill is faster. $#%#@ – wwii Aug 26 '20 at 17:52
  • @wwii: Cheers... :) – Andy L. Aug 26 '20 at 17:55