0

I am reading a data frame from the azure databricks cluster and converting it into a pandas data frame. Pandas declares the datatype as object for all features instead of int64.

The only solution is to use astype and covert each column individually, but I have 122 columns...

pd_train = df_train.toPandas() 
pd_test = df_test.toPandas()

pd_train.dtypes

pd_train displays the pandas dataframe for the training set pd_test displays the pandas dataframe for the testing set They are both spark dataframes

Pandas datatype

Image of spark dataframe

Chandra
  • 59
  • 1
  • 2
  • 6
  • 1
    I think this post may help and is a possible duplicate of [pandas: to\_numeric for multiple columns](https://stackoverflow.com/questions/36814100/pandas-to-numeric-for-multiple-columns). you could just pass all of the columns you want to the `[cols]` that this post is using in their code – MattR Jun 14 '19 at 14:44

1 Answers1

0

Here is one way of doing it.

First you could get all of the column names,

#Get column names 
columns = pd_train.columns

Next you could use pd.to_numeric and the column names to convert all columns to int64

#Convert to numeric
pd_train[columns] = pd_train[columns].apply(pd.to_numeric, errors='coerce')

You could then repeat this process for the pd_test dataframe.

BKlassen
  • 172
  • 1
  • 2
  • 9