Auto infer schema from parquet/ selectively convert string to float

Question

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).

I was not able to find a parameter similar to

inferSchema=True' #for spark.read.parquet, present for spark.read.csv

I tried changing

mergeSchema=True #but it doesn't improve the results

To manually cast columns as float, I used

df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))

this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.

Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?

stackoverflow blocked me after this question to ask more, any suggestion how can i improve this ? — pratiklodha, Mar 28 '18 at 18:16

score 1 · Answer 1 · edited Feb 02 '18 at 08:12

Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.

Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?

You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.

score 0 · Answer 2 · answered Feb 03 '18 at 15:50

There's no easy way currently, there's a Github issue already existing which can be referred

https://github.com/databricks/spark-csv/issues/264

somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

existing for scala this can be created for pyspark

Auto infer schema from parquet/ selectively convert string to float

2 Answers2