1

I'm converting pyspark data frames to pandas data frames using toPandas(). However, because some data types don't line up, pandas casts certain columns in the data frame, such as decimal fields, to object.

I'd like to run .str on my columns with actual strings, but can't see to get it to work (without explicitly finding which columns to convert first).

I run into:

AttributeError: Can only use .str accessor with string values!

I've tried df.fillna(0) and df.infer_objects(), to no avail. I can't see to get the objects to register as int64 or float64, so I can't do:

for col in df.columns: 
    if df[col].dtype == np.object:
        # insert logic here 

beforehand.

I also cannot use .str.contains, because even though the columns with numeric values are dtype objects, upon using .str it will error out. (For reference, what I'm trying to do is if the column in the data frame actually has string values, do a str.split().)

Any ideas?

Note: I am curious for an answer on the pandas side, without having to explicitly identify which columns actually have strings beforehand. One possible solution is to get the list of columns of strings on the pyspark side, and pass those as the columns to run .str methods on.

I also tried astype(str) but it won't work because some objects are arrays. I.e. if I wanted to split on _, and I had an array like ['Red_Apple', 'Orange'] in a column, doing astype(str).str.split on this column would return ['Red', 'Apple', 'Orange'], which doesn't make sense. I only want to split string columns, not turn arrays into strings and split them too.

accdias
  • 5,160
  • 3
  • 19
  • 31
L. Chu
  • 123
  • 3
  • 14
  • not sure what you try to do here, but why dont you just cast the data to string with str()? – Capie Jun 22 '20 at 20:54
  • @L. Chu What did you edit? – Red Jun 22 '20 at 21:00
  • @AnnZen Sorry looks like the edit didn't go through, it is in response to CalebCourtney's answer. – L. Chu Jun 24 '20 at 19:59
  • Stuff like this saddens me with python all the time. The definitive way to convert non-us locale floats from csv files (some countries use ',' as decimal sep) to something that can be interpreted as floats by the dataframe is to use str.replace(). Now you can't really tell beforehand if a column is string or already float, because it depends on the contents: Integers are processed just fine (no decimal) but anything else is still a string. So the only applicable answer is to just convert everything back to string, replace, then convert back? – antipattern Jul 20 '22 at 08:35

3 Answers3

0

You can use isinstance():

var = 'hello world'

if isinstance(var,str):
    # Do something
Red
  • 26,798
  • 7
  • 36
  • 58
0

A couple of ideas here:

  1. Convert the column to string anyways using astype: df[col_name].astype(str).str.split().
  2. Check the column types with df.dtypes(), and only run the str.split() on columns that are already type object.

This is really up to you for how you want to implement it, but if you want to treat the column as a str anyways, I would go with option 1.

Caleb Courtney
  • 316
  • 2
  • 5
  • For 1: I added an edit shortly after you answered: astype won't work due to some objects being arrays. I.e. if I wanted to split on "_" and I had an array [Red_Apple, Orange], doing astype(str).str.split on this column would return [Red_, Apple, Orange] , which doesn't make sense. For 2. In my case, all dtypes are objects in the dataframe upon converting from Pyspark to pandas df, so this wouldn't work either. – L. Chu Jun 22 '20 at 21:13
0

Hope I got you right. You can use [.select_dtypes][1]

df = pd.DataFrame({'A':['9','3','7'],'b':['11.0','8.0','9'], 'c':[2,5,9]})#DataFrame
print(df.dtypes)#Check df dtypes


A    object
b    object
c     int64
dtype: object

df2=df.select_dtypes(include='object')#Isolate object dtype columns
df3=df.select_dtypes(exclude='object')#Isolate nonobject dtype columns
df2=df2.astype('float')#Convert object columns to float
res=df3.join(df2)#Rejoin the datframes
res.dtypes#Recheck the dtypes

c      int64
A    float64
b    float64
dtype: object
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • This won't work when there is string column that can't be cast to float. ex: 'd': ['Apple', 'Pear', 'Tree'] will error out on .astype('float') . – L. Chu Jun 24 '20 at 19:55