0

I am using pandas and I have a column that has numbers but when I check for datatype I get the column is an object. I think one of the rows in that column is actually a string. How can I find out which row is the string? For example:

Name    A    B
John    0    1
Rich    1    0
Jim     O    1

Jim has the letter "O" instead of zero on column A. what can I use in pandas to find which row has the string instead of the number if I have thousands of rows? In this example I used the letter O, but it could be any letter, really.

user7146708
  • 187
  • 4
  • 12
  • `object` means that it holds generic Python-typed values, which can be any type Python knows—an `int`, a `str`, a `list` of `set`s of some custom `namedtuple` type that you created, whatever. And you can just call methods on those objects (e.g., via Pandas' `apply`) the same way you do with any other Python variables. – abarnert May 25 '18 at 21:58
  • Related: [Strings in a DataFrame, but dtype is object](https://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object) – jpp May 25 '18 at 22:03

4 Answers4

0

You can use pandas.to_numeric to see what doesn't get converted to a number. Then with .isnull() you can subset your original df to see exactly which rows are the problematic ones.

import pandas as pd
df[pd.to_numeric(df.A, errors='coerce').isnull()]
#  Name  A  B
#2  Jim  O  1

If you're not sure which column is problematic, you could so something like (assuming you want to check everything other than the 1st name column):

df2 = pd.DataFrame()
for col in df.columns[1::]:
    df2[col] = pd.to_numeric(df[col], errors='coerce')

df[df2.isnull().sum(axis=1).astype(bool)]
#  Name  A  B
#2  Jim  O  1
ALollz
  • 57,915
  • 7
  • 66
  • 89
0

The dtype object means that the column holds generic Python-typed values.

Those values can be any type Python knows—an int, a str, a list of sets of some custom namedtuple type that you created, whatever.

And you can just call normal Python functions or methods on those objects (e.g., by accessing them directly, or via Pandas' apply) the same way you do with any other Python variables.

And that includes the type function, the isinstance function, etc.:

>>> df = pd.DataFrame({'A': [0, 1, 'O'], 'B': [1, 0, 1]})
>>> df.A
0    0
1    1
2    O
Name: A, dtype: object
>>> df.A.apply(type)
0    <class 'int'>
1    <class 'int'>
2    <class 'str'>
Name: A, dtype: object
>>> df.A.apply(lambda x: isinstance(x, str))
0    False
1    False
2     True
Name: A, dtype: bool
>>>  df.A.apply(repr)
0      0
1      1
2    'O'
Name: A, dtype: object

… and so on.

abarnert
  • 354,177
  • 51
  • 601
  • 671
0

I'd like to add another very short and concise solution which would be a combination of ALollz and abarnert.

First let's find all columns that are of type object with cols = (df.dtypes == 'object').nonzero()[0]. Let us filter those out using iloc and apply pd.to_numeric (and let us also not include the name column using a slice of the col variable [1:]). Then we check for na-values and if any(1) (row-wise) then we return back the index of that row.

Full example:

import pandas as pd

data = '''\
Name    A    B   C
John    0    1   O
Rich    1    0   2
Jim     O    1   O'''

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')

cols = (df.dtypes == 'object').nonzero()[0]
rows = df.iloc[:,cols[1:]].apply(pd.to_numeric, errors='coerce').isna().any(1).nonzero()[0]
print(rows)

Returns:

[0 2]  # <-- Means that row 0 and 2 contain N/A-values in at least 1 column

This answers your question: what can I use in pandas to find which row has the string instead of the number but for all columns looking for strings by assuming they can't be converted to numbers with pd.to_numeric.

Anton vBR
  • 18,287
  • 5
  • 40
  • 46
0
types = list(df['A'].apply(lambda x: type(x)))
names = list(df['Name'])
d = dict(zip(names, types))

This will give you a dictionary of {name:type} so you know which name has a string value in column A. Alternatively, if you just want to find the row the string is on, use this:

types = list(df['A'].apply(lambda x: type(x)))
rows = df.index.tolist()
d = dict(zip(rows, types))
# to get only the rows that have string values in column A
d = {k:v for k,v in d.items() if v == str}
Camo
  • 3
  • 3
PyRsquared
  • 6,970
  • 11
  • 50
  • 86