3

In a pandas dataframe, a column with dtype = object can, in fact, contain items of mixed types, eg integers and strings.

In this example, column a is dtype object, but the first item is string while all the others are int:

import numpy as np, pandas as pd
df=pd.DataFrame()
df['a']=np.arange(0,9)
df.iloc[0,0]='test'
print(df.dtypes)
print(type(df.iloc[0,0]))
print(type(df.iloc[1,0]))

My question is: is there a quick way to identify which columns with dtype=object contain, in fact, mixed types like above? Since pandas does not have a dtype = str, this is not immediately apparent.

However, I have had situations where, importing a large csv file into pandas, I would get a warning like:

sys:1: DtypeWarning: Columns (15,16) have mixed types. Specify dtype option on import or set low_memory=False

Is there an easy way to replicate that and explicitly list the columns with mixed types? Or do I manually have to go through them one by one, see if I can convert them to string, etc?

The background is that I am trying to export a dataframe to a Microsoft SQL Server using DataFrame.to_sql and SQLAlchemy. I get an

OverflowError: int too big to convert

but my dataframe does not contain columns with dtype int - only object and float64. I'm guessing this is because one of the object columns must have both strings and integers.

halfer
  • 19,824
  • 17
  • 99
  • 186
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
  • just use `dtype` parameter when calling `df.to_sql(...)` – MaxU - stand with Ukraine Nov 16 '17 at 09:55
  • That would mean passing a dictionary mapping column names to dtypes, ie mapping all the object columns to string. Two doubts: 1) it doesn't seem very different from converting to str in pandas itself, before exporting. 2) what else could an object column contain (which I wouldn't want to convert to string)? I suppose dates, right? – Pythonista anonymous Nov 16 '17 at 10:04
  • I don't understand what is the problem. You can't have mixed data types (for one column) in the SQL DB, so you either cast your values yourself or it will be done for you implicitly... – MaxU - stand with Ukraine Nov 16 '17 at 10:09
  • The 'done implicitly' part doesn't work - see the overflow error above. More generally, I'd like to understand if there is a quick way to identify columns with mixed types, so that I convert only those, and don't needlessly convert object columns which contrain strings only. – Pythonista anonymous Nov 16 '17 at 10:15
  • 2
    piRSquared has provided you a way to identify them, but IMO it would be much more efficient to specify correct dtypes instead of scan the whole DF to find mixed dtypes – MaxU - stand with Ukraine Nov 16 '17 at 10:17
  • In theory I agree. In practice, however, there are situations where I receive big tables with, say, 60 columns; initially I only need 10, so, while I need to import everything, it would not be the best use of my times to cleanse and specify correct dtypes for all 60 columns. My compromise is to cleanse only those that I otherwise couldn't import, and worry about the others later, if and when I'll need to work on them. – Pythonista anonymous Nov 16 '17 at 10:24
  • 2
    i'm afraid you will have to do it anyway (assuming those tables have at least one string/varchar column) because of: [reason 1](https://stackoverflow.com/questions/42727990/speed-up-to-sql-when-writing-pandas-dataframe-to-oracle-database-using-sqlalch/42769557?s=1|55.3166#42769557), [reason 2](https://stackoverflow.com/questions/39504351/save-pandas-string-object-column-as-varchar-in-oracle-db-instead-of-clob-defa/39514888?s=2|47.2155#39514888). – MaxU - stand with Ukraine Nov 16 '17 at 10:26
  • I solved the puzzle, but I'd heed @MaxU's advice. He is smart, informed, and always helpful. – piRSquared Nov 16 '17 at 10:29
  • Ah, interesting. Do you know if the same applies when exporting to Microsoft SQL Server? – Pythonista anonymous Nov 16 '17 at 10:31
  • @piRSquared, thank you! I wish i would be as smart as you :-D – MaxU - stand with Ukraine Nov 16 '17 at 10:34
  • @Pythonistaanonymous, unfortunately i don't have SQL Server to play with, but you will see it immediately after importing data from Pandas... – MaxU - stand with Ukraine Nov 16 '17 at 10:35
  • There's a problem with converting object columns to strings: NaN becomes the string 'nan'. Yet another proof that pandas absolutely sucks at dealing with missing values! Eg groupbys also get rid of missing values, without even a warning. – Pythonista anonymous Nov 16 '17 at 14:21

1 Answers1

7

Setup

df = pd.DataFrame(np.ones((3, 3)), columns=list('WXY')).assign(Z='c')
df.iloc[0, 0] = 'a'
df.iloc[1, 2] = 'b'

df

   W    X  Y  Z
0  a  1.0  1  c
1  1  1.0  b  c
2  1  1.0  1  c

Solution
Find all types and count how many unique ones per column.

df.loc[:, df.applymap(type).nunique().gt(1)]

   W  Y
0  a  1
1  1  b
2  1  1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Just thinking, this doesn't distinguish between a column that contains all strings and some nans (which are float), and one which contains, say, strings and integers – Pythonista anonymous Nov 17 '17 at 09:15
  • I do wonder how people can use pandas for data science if it's so incredibly bad at dealing with missing values! – Pythonista anonymous Nov 17 '17 at 12:43
  • That’s like saying “I wonder why surgeons use scalpels if there so bad at controlling the bleeding.” – piRSquared Nov 17 '17 at 13:40
  • No, no relevance whatsoever. Missing values are a fact of life. Dealing with them is incredibly painful in Pandas, because the same tasks that are straightforward and natural in other tools (eg any database) require you to jump through many hoops. Note that this is not a result of an explicit decision by the Pandas developers, but more the fact that Pandas was built on top of numpy. Why does pandas groupby remove nans? Why can't a column of integers or strings have nulls? Why can nulls only be floats? Please, someone explain why any of this makes any sense! – Pythonista anonymous Nov 17 '17 at 14:25
  • @Pythonistaanonymous There are actually two types of nulls in pandas: `float('nan')`, and native python `None`. `None` can only be used with the Object dtype, though. – T.C. Proctor Dec 06 '17 at 18:42
  • The fact remains that a column of int cannot contain nulls – Pythonista anonymous Dec 07 '17 at 21:55
  • 1
    Just to clarify one little confusion.. What does the column (15,16) mean? Index number for columns starting with zero? Does it include anything between 15 and 16? Say I get (23,25) does it include 24 or just the two columns? – Meet Mar 16 '21 at 13:37