6

Given a pandas.DataFrame with a column holding mixed datatypes, like e.g.

df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})

I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.

I could iteratively derive a mask and use it in loc, like

m = np.array([isinstance(v, int) for v in df['mixed']])

df.loc[m, 'mixed'] *= 10

# df
#                  mixed
# 0  2020-10-04 00:00:00
# 1                 9990
# 2             a string

That does the trick but I was wondering if there was a more pandastic way of doing this?

accdias
  • 5,160
  • 3
  • 19
  • 31
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • 1
    I would think conversely: If "real" efficient vectorization (built-in C++-level optimized functions) were to take place, are these functions likely to perform the costly type check themselves **for each element**? I guess that is not likely the case because it defies the central idea of efficient vectorization. Therefore I guess python-level type checks could not be avoided. In this case, map/apply/list comprehension/etc. are [not likely to differ much efficiency-wise](https://stackoverflow.com/questions/35215161). (P.s. This argument does not concern syntactic aesthetics.) – Bill Huang Oct 13 '20 at 06:57

5 Answers5

4

Still need call type

m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string
BENY
  • 317,841
  • 20
  • 164
  • 234
4

One idea is test if numeric by to_numeric with errors='coerce' and for non missing values:

m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string

Unfortunately is is slow, some another ideas:

N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})


In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas by default here cannot use vectorization effectivelly, because mixed values - so is necessary elementwise approaches.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

If you want to multiple all 'numbers' then you can use the following.

Let's use pd.to_numeric with parameter errors = 'coerce' and fillna:

df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
df

Output:

                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string

Let's add a float to the column

df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string', 100.3]})

Using @BenYo:

m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df

Output (note only the integer 999 is multiplied by 10):

                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string
3                100.3

Using @jezrael and similiarly this solution:

m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print(df)

# Or this solution
# df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])

Output (note all numbers are multiplied by 10):

                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string
3                 1003
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    that's a good point, `to_numeric` won't allow to distinguish between int and float. In practice this could be ok (if it's just about to modify *all numbers*). – FObersteiner Oct 14 '20 at 15:26
1

If you do many calculation and have a littile more memory, I suggest you to add a column to indicate the type of the mixed, for better efficiency. After you construct this column, the calculation is much faster.

here's the code:

N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
df["mixed_type"] = df.mixed.map(lambda x: type(x).__name__).astype('category')
m = df.mixed_type == 'int'
df.loc[m, "mixed"] *= 10
del df["mixed_type"] # after you finish all your calculation

the mixed_type column repr is

0          Timestamp
1                int
2                str
3          Timestamp
4                int
             ...    
2999995          int
2999996          str
2999997    Timestamp
2999998          int
2999999          str
Name: mixed, Length: 3000000, dtype: category
Categories (3, object): [Timestamp, int, str]

and here's the timeit

>>> %timeit df.mixed_type == 'int'
472 µs ± 57.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.12 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Hieast
  • 87
  • 5
  • this is what I had in mind originally, only that I was hoping there would be a way around creating an extra column for the types :) A good approach anyways if operations should be done for multiple different types! – FObersteiner Oct 17 '20 at 09:16
  • @MrFuppes I guess BEN_YO's answer is the only answer, you must call `type` to get the datatypes , pandas only maintain a series level dtype `object` for mixed dtype https://pbpython.com/pandas_dtypes.html. Other answers are only sutable for int type. My answer is to show `category` dtype help make it faster. – Hieast Oct 17 '20 at 09:53
  • @MrFuppes I think maintaining each cell's datatype is far from efficiency, most people won't use them. If anyone has this demand, just do himself. – Hieast Oct 17 '20 at 09:56
  • to me, Bill Huang's comment on the question is a very good summary. everything else is more or less *syntactic aesthetics*. But I hope all the answers here hopefully help others if they encounter the problem. – FObersteiner Oct 17 '20 at 10:05
1

For not very long data frames I can suggest this way as well:

df = df.assign(mixed = lambda x: x.apply(lambda s: s['mixed']*10 if isinstance(s['mixed'], int) else s['mixed'],axis=1))
Mehdi Golzadeh
  • 2,594
  • 1
  • 16
  • 28