0

Given a df

df = pd.DataFrame(['0', '1', '2', '3'], columns=['a'])

What is the difference between using

 df['b'] = df['a'].apply(np.int)

,

df['b'] = df['a'].apply(lambda x : int(x))

and

df['b'] = df['a'].astype(int)

?

I'm aware that all will give the same result but are there any differences?

bryan.blackbee
  • 1,934
  • 4
  • 32
  • 46
  • Possible duplicate of [Difference between np.int, np.int\_, int, and np.int\_t in cython?](https://stackoverflow.com/questions/21851985/difference-between-np-int-np-int-int-and-np-int-t-in-cython) – Dominique Paul Oct 21 '18 at 11:23

3 Answers3

0

np.int is an alias for int.

You can test this by running:

import numpy as np
print(int == np.int)

which will return True.

Also: consider checking out this question which has a very thorough explanation of your question.

Dominique Paul
  • 1,623
  • 2
  • 18
  • 31
0

The below uses pandas apply function to iteratively use numpy's int cast which is same as python's int cast. So, both of these are alas the same.

df['b'] = df['a'].apply(np.int)
df['b'] = df['a'].apply(lambda x : int(x))

The astype function however casts an series to specified dtype, here int which for pandas is int64.

df['b'] = df['a'].astype(int)

astype is a vectorized function and I would prefer to use it rather than the apply method due to its poor time complexity as compared to astype.

Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
0

When you use apply it works by looping over the data and changing the dtype of each value to integer. So they are slower when compared to astype

df = pd.DataFrame(pd.np.arange(10**7).reshape(10**4, 10**3)).astype(str)

# Performance
%timeit df[0].apply(np.int)
7.15 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df[0].apply(lambda x : int(x))
9.57 ms ± 405 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Both are almost similar in terms of performance.

Here astype which is function optimized to work faster than apply.

%timeit df[0].astype(int)
1.94 ms ± 96.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And If you are looking for a much much faster approach then we should opt for vectorized approach which numpy arrays can provide.

%timeit df[0].values.astype(np.int)
1.26 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As you can see the time difference is huge.

Sai Kumar
  • 665
  • 2
  • 9
  • 21