2

With this pycon talk as a source.

def clean_string(item):
    if type(item)==type(1):
        return item
    else:
        return np.nan

dataframe object has a column containing numerical and string data, I want to change strings to np.nan while leaving numerical data as it is.

This approach is working fine

df['Energy Supply'].apply(clean_string)

but when I am trying to use vectorisation, values of all the column items changed to np.nan

df['Energy Supply'] = clean_string(df['Energy Supply'])  # vectorisation

but the above method is converting all items to np.nan. I believe this is because type(item) in clean_string function is pd.Series type.

Is there a way to overcome this problem?

PS: I am a beginner in pandas

Zephyr
  • 11,891
  • 53
  • 45
  • 80
aman goyal
  • 191
  • 2
  • 14

1 Answers1

1

Vectorizing an operation in pandas isn't always possible. I'm not aware of a pandas built-in vectorized way to get the type of the elements in a Series, so your .apply() solution may be the best approach.

The reason that your code doesn't work in the second case is that you are passing the entire Series to your clean_string() function. It compares the type of the Series to type(1), which is False and then returns one value np.nan. Pandas automatically broadcasts this value when assigning it back to the df, so you get a column of NaN. In order to avoid this, you would have to loop over all of the elements in the Series in your clean_string() function.

Out of curiosity, I tried a few other approaches to see if any of them would be faster than your version. To test, I created 10,000 and 100,000 element pd.Series with alternating integer and string values:

import numpy as np
import pandas as pd

s = pd.Series(i if i%2==0 else str(i) for i in range(10000))
s2 = pd.Series(i if i%2==0 else str(i) for i in range(100000))

These tests are done using pandas 1.0.3 and python 3.8.

Baseline using clean_string()

In []: %timeit s.apply(clean_string)
3.75 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In []: %timeit s2.apply(clean_string)
39.5 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Series.str methods

An alternative way to test for strings vs. non-strings would be to use the built-in .str functions on the Series, for example, if you apply .str.len(), it will return NaN for any non-strings in the Series. These are even called "Vectorized String Methods" in pandas documentation, so maybe they will be more efficient.

In []: %timeit s.mask(s.str.len()>0)
6 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In []: %timeit s2.mask(s2.str.len()>0)
56.8 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Unfortunately, this approach is slower than the .apply(). Despite being "vectorized" it doesn't look like this is a better approach. It is also not quite identical to the logic of clean_string() because it is testing for elements that are strings not for elements that are integers.

Applying type directly to the Series

Based on this answer, I decided to test using .apply() with type to get the type of each element. Once we know the type, compare to int and use the .mask() method to convert any non-integers to NaN.

In []: %timeit s.mask(s.apply(type)!=int)
1.88 ms ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In []: %timeit s2.mask(s2.apply(type)!=int)
15.2 ms ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This turns out to be the fastest approach that I've found.

Craig
  • 4,605
  • 1
  • 18
  • 28