I'm converting pyspark
data frames to pandas data frames using toPandas()
. However, because some data types don't line up, pandas casts certain columns in the data frame, such as decimal fields, to object.
I'd like to run .str
on my columns with actual strings, but can't see to get it to work (without explicitly finding which columns to convert first).
I run into:
AttributeError: Can only use .str accessor with string values!
I've tried df.fillna(0)
and df.infer_objects()
, to no avail. I can't see to get the objects to register as int64
or float64
, so I can't do:
for col in df.columns:
if df[col].dtype == np.object:
# insert logic here
beforehand.
I also cannot use .str.contains
, because even though the columns with numeric values are dtype
objects, upon using .str
it will error out. (For reference, what I'm trying to do is if the column in the data frame actually has string values, do a str.split()
.)
Any ideas?
Note: I am curious for an answer on the pandas
side, without having to explicitly identify which columns actually have strings beforehand. One possible solution is to get the list of columns of strings on the pyspark
side, and pass those as the columns to run .str
methods on.
I also tried astype(str)
but it won't work because some objects are arrays
. I.e. if I wanted to split on _
, and I had an array
like ['Red_Apple', 'Orange']
in a column, doing astype(str).str.split
on this column would return ['Red', 'Apple', 'Orange']
, which doesn't make sense. I only want to split string
columns, not turn arrays into strings and split them too.