2

I import some data from a parquet file into a DataFrame and want to check the data types. One of the data types I expect is strings. To do this, I have something like the following:

import pandas as pd
col = pd.Series([None, 'b', 'c', None, 'e'])
assert((col.dtype == object) and (isinstance(col[0], str)))

But, as you can see, this does not work if I accidentally have a None value at the beginning.

Does anybody have an idea how to do that efficiently (preferably without having to check each element of the series)?

Raphael
  • 208
  • 1
  • 10

3 Answers3

3

You can use first_valid_index to retrieve and check the first non-NA item:

isinstance(col.iloc[col.first_valid_index()], str)
Stef
  • 28,728
  • 2
  • 24
  • 52
2

As of Pandas 1.0.0 there's a StringDtype, which you can use to check if the pd.Series contains only either NaN or string values:

try:
    col.astype('string')
except ValueError as e:
    raise e

If you try with a column containing an int:

col = pd.Series([None, 2, 'c', None, 'e'])

try:
    col.astype('string')
except ValueError as e:
    raise e

You'd get a ValueError:

ValueError: StringArray requires a sequence of strings or pandas.NA

yatu
  • 86,083
  • 12
  • 84
  • 139
  • I like that solution, but since my Series is quite long, I'd go with the spot-check proposed by @Stef – Raphael Jun 15 '20 at 16:12
  • 1
    That depends on the purpose. If you have an object dtype and want to ensure the entire column contains no numerical data I'd go with this one. If that is not necessary the stef's appraoch should do @raphael – yatu Jun 15 '20 at 17:25
0

you can convert entire series all values to str type as follows:

col = col.astype(str)

None value will became string value.

Narendra Prasath
  • 1,501
  • 1
  • 10
  • 20