0

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).

I was not able to find a parameter similar to

inferSchema=True' #for spark.read.parquet, present for spark.read.csv

I tried changing

mergeSchema=True #but it doesn't improve the results

To manually cast columns as float, I used

df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))

this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.

Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?

Shaido
  • 27,497
  • 23
  • 70
  • 73
pratiklodha
  • 1,095
  • 12
  • 20

2 Answers2

1

Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.

Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?

You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.

Shaido
  • 27,497
  • 23
  • 70
  • 73
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
0

There's no easy way currently, there's a Github issue already existing which can be referred

https://github.com/databricks/spark-csv/issues/264

somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

existing for scala this can be created for pyspark

pratiklodha
  • 1,095
  • 12
  • 20