Table created with "stored as Parquet" option using PySpark SQL or Hive does not actually store data files in Parquet format

Question

I create table on Hadoop cluster using PySpark SQL:spark.sql("CREATE TABLE my_table (...) PARTITIONED BY (...) STORED AS Parquet") and load some data with: spark.sql("INSERT INTO my_table SELECT * FROM my_other_table"), however the resulting files do not seem to be Parquet files, they're missing ".snappy.parquet" extension.

The same problem occurs when repeating those steps in Hive.

But surprisingly when I create table using PySpark DataFrame: df.write.partitionBy("my_column").saveAsTable(name="my_table", format="Parquet") everything works just fine.

So, my question is: what's wrong with the SQL way of creating and populating Parquet table?

Spark version 2.4.5, Hive version 3.1.2.

Update (27 Dec 2022 after @mazaneicha answer) Unfortunately, there is no parquet-tools on the cluster I'm working with, so the best I could do is to check the content of the files with hdfs dfs -tail (and -head). And in all cases there is "PAR1" both at the beginning and at the end of the file. And even more - the meta-data of parquet version (implementation):

Method                      # of files      Total size  Parquet version                 File name

Hive Insert                 8               34.7 G      Jparquet-mr version 1.10.0      xxxxxx_x
PySpark SQL Insert          8               10.4 G      Iparquet-mr version 1.6.0       part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF insertInto       8               10.9 G      Iparquet-mr version 1.6.0       part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF saveAsTable      8               11.5 G      Jparquet-mr version 1.10.1      part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet

(To create the same number of files I used "repartition" with df, and "distribute by" with SQL).

So, considering the above mentioned, it's still not clear:

Why there is no file extension in 3 out of 4 cases?
Why files created with Hive are so big? (no compression, I suppose).
Why PySpark SQL and PySpark Dataframe versions/implementations of parquet differ and how set them explicitly?

As a rule of thumb, you should always use `saveAsTable` with append mode, rather than `INSERT INTO ` — OneCricketeer, Dec 23 '22 at 21:07

mazaneicha · Answer 1 · 2022-12-27T14:39:17.123

2

File format is not defined by the extension, but rather by the contents. You can quickly check if format is parquet by looking for magic bytes PAR1 at the very beginning and the very end of a file.

For in-depth format, metadata and consistency checking, try opening a file with parquet-tools.

Update:
As mentioned in online docs, parquet is supported by Spark as one of the many data sources via its common DataSource framework, so that it doesn't have to rely on Hive: "When reading from Hive metastore Parquet tables and writing to non-partitioned Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance..."

You can find and review this implementation in Spark git repo (its open-source! :))

edited Dec 27 '22 at 14:39

answered Dec 23 '22 at 14:57

mazaneicha

8,794
4
33
52

Actually, the quote you're referring to is not exactly what I'm talking about. As shown in my test cases Spark itself behaves a bit differently with DF and SQL which is confusing. And I can't agree more - the most solid way to get all the answers is to dive into source code, but it's not always the easiest one :) – Pronator Teres Dec 29 '22 at 12:12
I hope by now your question (clearly stated in the subject line) was answered. Adding extra questions into the body is frowned upon, and even punished, by MetaStack "police". So please accept the answer, and create a new question if something is still unclear. – mazaneicha Dec 29 '22 at 14:30

Table created with "stored as Parquet" option using PySpark SQL or Hive does not actually store data files in Parquet format

1 Answers1