1

I have a column in a dataframe that I need to join on. The column contains mixed data types, eg:

s = pd.Series([3985500,'3985500',3985500.0,'3985500.0','3985500A','3985500B'])

I'm trying to convert everything that's numeric to int to ensure the key is found when joining. Whatever is string can remain string and the final column format is allowed to be string, as long as the floats are converted to int.

I have tried astype(), but it ignores floats and for some reason I keep on getting scientific notation (see index 2 and 3):

s.astype(int, errors='ignore')

0       3985500
1       3985500
2    3.9855e+06
3     3985500.0
4      3985500A
5      3985500B
dtype: object

I get pd.to_numeric to work on floats with a try-except:

try: int(pd.to_numeric(s[3]))
except ValueError: s[3]

3985500
dtype: int

However, as soon as I try it in a function it returns nothing:

def convert_to_int(cell):
  try: int(pd.to_numeric(cell))
  except ValueError: cell

convert_to_int(s[3])

Any idea why this is happening? There might be other workarounds, but why is it not working when it's in a function?

I wish to use this function with s.apply(). I have looked at a couple of similar posts:

RianLauw
  • 85
  • 7
  • 1
    `return int(pd.to_numeric(cell))`. Also, you can probably pass `downcast` argument as "integer" to `to_numeric` and avoid having to use `int` to cast. – lightalchemist Jul 19 '20 at 08:02
  • Do you run this in a notebook? Your bare try-except should not produce output either, otherwise. – MisterMiyagi Jul 19 '20 at 08:05
  • Thanks @lightalchemist, the return is exactly what was missing. See the accepted answer. However, using the downcast option still returns the value at index 2 as a float for some reason. – RianLauw Jul 19 '20 at 15:12
  • @MisterMiyagi, yes I'm running this in a notebook which is why it was displaying a result in the bare `try-except`. – RianLauw Jul 19 '20 at 15:12

2 Answers2

1

You are not returning any value from your function i.e,

def convert_to_int(cell):
  try: 
      return int(pd.to_numeric(cell))
  except ValueError: 
      cell

convert_to_int(s[3])

And to use apply() syntax:

s.apply(lambda x:convert_to_int(x))
smile
  • 574
  • 5
  • 11
Xythprynx
  • 39
  • 6
  • Thanks! This was a rookie mistake. I assumed the return is implied as it was displaying a value in Google Colab without the return when it's without the function. But it makes sense as it was not a function yet. I accepted this as the correct answer as it answers the question on why it's not working as a function. The answer by @MayankPorwal seems more efficient though. – RianLauw Jul 19 '20 at 14:59
  • Btw, the `except` also needs a return if I'm not mistaken? – RianLauw Jul 19 '20 at 15:00
1

You need to convert all numeric type values into int. You don't need to have a separate function, you can simply use Series.apply, like this:

In [202]: s = pd.Series([3985500,'3985500',3985500.0,'3985500.0','3985500A','3985500B'])
In [203]: s = s.apply(lambda x:x if isinstance(x, str) else int(x)) 

In [204]: s
Out[204]: 
0      3985500
1      3985500
2      3985500
3    3985500.0
4     3985500A
5     3985500B
dtype: object

The above command converts all numeric types to int, keeping string as it is.

halfer
  • 19,824
  • 17
  • 99
  • 186
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
  • Thanks, but this method still returns the number at index 3 as a float. It seems more efficient though. – RianLauw Jul 19 '20 at 15:02
  • It returns a float at index 3 because, it us defined as a string while initializing `s`. – Mayank Porwal Jul 19 '20 at 15:06
  • I see what you mean. That's why I went with the `try-except`, to first try and convert to int. How would you make provision for the string at index 3? – RianLauw Jul 19 '20 at 15:17