10

I have a Pandas Dataframe which has columns which look something like this:

df:

Column0   Column1     Column2
'MSC'       '1'        'R2'
'MIS'       'Tuesday'  '22'
'13'        'Finance'  'Monday'

So overall, in these columns are actual strings but also numeric values (integers) which are in string format.

I found this nice post about the pd.to_numeric and astype() methods, but I can't see if or how I could use them in my case.

Using:

pd.to_numeric(df, errors = 'ignore')

just results in skiping the whole columns. Instead of skipping the whole columns, I only want to skip the strings in those columns which can't be converted, move on to the next entry and try to convert the next string.

So in the end, my dataframe would look like this:

df:

Column0   Column1     Column2
'MSC'       1          'R2'
'MIS'      'Tuesday'    22
 13        'Finance'  'Monday'

Is there maybe an efficient way to loop over these columns and achieve that?

Best regards, Jan

EDIT: Thanks for all your suggestions! Since I am still a python beginner, @coldspeed and @sacul 's answers are easier to understand for me so I will go with one of them!

cs95
  • 379,657
  • 97
  • 704
  • 746
JanB
  • 179
  • 2
  • 2
  • 10
  • 1
    This just doesn't fit into the pandas model. You're looking for mixed types in a single column... unless you just prefer pandas syntax, you're losing any kind of efficiency gains you might get in data processing – roganjosh Dec 04 '18 at 23:07
  • 1
    Do you still need to retain the strings that won't convert? If so, would moving them to a new column be acceptable? Mixed data types in a single column doesn't work. – Dan Dec 04 '18 at 23:08
  • 1
    @roganjosh, although sometimes when you are using pandas to clean up data that you received with mixed types, this is relevant. – Nesha25 Feb 16 '22 at 23:40

4 Answers4

16

100% agree with the comments—mixing dtypes in columns is a terrible idea, performance wise.

For reference, however, I would do this with pd.to_numeric and fillna:

df2 = df.apply(pd.to_numeric, errors='coerce').fillna(df)
print(df2)
  Column0  Column1 Column2
0     MSC        1      R2
1     MIS  Tuesday      22
2      13  Finance  Monday

Columns are cast to object dtype to prevent coercion. You can see this when you extract the values:

print(df2.values.tolist())
[['MSC', 1.0, 'R2'], ['MIS', 'Tuesday', 22.0], [13.0, 'Finance', 'Monday']]
cs95
  • 379,657
  • 97
  • 704
  • 746
3

I would apply pd.to_numeric with errors='coerce', and update the original dataframe according to the results (see caveats in comments):

# show original string type:
df.loc[0,'Column1']
# '1'

df.update(df.apply(pd.to_numeric, errors='coerce'))

>>> df
  Column0  Column1 Column2
0     MSC        1      R2
1     MIS  Tuesday      22
2      13  Finance  Monday

# show updated float type:
df.loc[0,'Column1']
# 1.0
sacuL
  • 49,704
  • 8
  • 81
  • 106
3

Or you could simply use the isnumeric() method of str. I like it because the syntax is clear, although according to coldspeed's comment, this can become very slow on large df.

df = df.applymap(lambda x: int(x) if x.isnumeric() else x)

Example:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([['a','b','c'],['1','1a','c']],columns=['Col1','Col2','Col3'])

In [3]: df
Out[3]:
  Col1 Col2 Col3
0    a    b    c
1    1   1a    c

In [4]: df.Col1.map(lambda x: int(x) if x.isnumeric() else x)
Out[4]:
0    a
1    1
Name: Col1, dtype: object
3

Using to_numeric + ignore

df=df.applymap(lambda x : pd.to_numeric(x,errors='ignore'))
df
  Column0  Column1 Column2
0     MSC        1      R2
1     MIS  Tuesday      22
2      13  Finance  Monday
df.applymap(type)
                 Column0                Column1                Column2
0          <class 'str'>  <class 'numpy.int64'>          <class 'str'>
1          <class 'str'>          <class 'str'>  <class 'numpy.int64'>
2  <class 'numpy.int64'>          <class 'str'>          <class 'str'>
sacuL
  • 49,704
  • 8
  • 81
  • 106
BENY
  • 317,841
  • 20
  • 164
  • 234