I create table on Hadoop cluster using PySpark SQL:spark.sql("CREATE TABLE my_table (...) PARTITIONED BY (...) STORED AS Parquet")
and load some data with: spark.sql("INSERT INTO my_table SELECT * FROM my_other_table")
, however the resulting files do not seem to be Parquet files, they're missing ".snappy.parquet" extension.
The same problem occurs when repeating those steps in Hive.
But surprisingly when I create table using PySpark DataFrame: df.write.partitionBy("my_column").saveAsTable(name="my_table", format="Parquet")
everything works just fine.
So, my question is: what's wrong with the SQL way of creating and populating Parquet table?
Spark version 2.4.5, Hive version 3.1.2.
Update (27 Dec 2022 after @mazaneicha answer) Unfortunately, there is no parquet-tools on the cluster I'm working with, so the best I could do is to check the content of the files with hdfs dfs -tail (and -head). And in all cases there is "PAR1" both at the beginning and at the end of the file. And even more - the meta-data of parquet version (implementation):
Method # of files Total size Parquet version File name
Hive Insert 8 34.7 G Jparquet-mr version 1.10.0 xxxxxx_x
PySpark SQL Insert 8 10.4 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF insertInto 8 10.9 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF saveAsTable 8 11.5 G Jparquet-mr version 1.10.1 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet
(To create the same number of files I used "repartition" with df, and "distribute by" with SQL).
So, considering the above mentioned, it's still not clear:
- Why there is no file extension in 3 out of 4 cases?
- Why files created with Hive are so big? (no compression, I suppose).
- Why PySpark SQL and PySpark Dataframe versions/implementations of parquet differ and how set them explicitly?