Let's go step-by-step.
In the case of df.saveAsTable("mytable")
, the table is actually written to storage (HDFS/ S3). It is a Spark action.
On the other hand:
df.createOrReplaceTempView("my_temp_table")
is a transformation. It is just an identifier to be used for the DAG of df
. Nothing is actually stored in memory or on disk.
spark.sql("drop table if exists " + my_temp_table)
drops the table.
spark.sql("create table mytable as select * from my_temp_table")
creates mytable
on storage. createOrReplaceTempView
creates tables in global_temp
database.
It would be best to modify the query to:
create table mytable as select * from global_temp.my_temp_table
createOrReplaceTempView only register the dataframe (already in
memory) to be accessible through Hive query, without actually
persisting it, is it correct?
Yes, for large DAGs, spark will automatically cache data depending on spark.memory.fraction
setting. Check this page.
I have to Join hundreds of tables and hit OutOfMemory issue. In terms
of efficiency, what would be the best way ?
df.persist() and df.join(..).join(..).join(..).... #hundred joins
createOrReplaceTempView then join with spark.sql(),
SaveAsTable (? not sure the next step)
Write to disk with Create Table then join with spark.sql()?
persist
would store some data in cached format depending on available memory and for end table that is generated by joining hundreds of tables, this would probably is not the best approach.
It would not be possible to suggest the approach that would work for you, but here are some general patterns:
If writes fail with OOM and the default spark.shuffle.partitions
are used, then the start point is to increase the shuffle partition count to ensure that each executor's partition is sized correctly depending on its memory availability.
The spark.shuffle.partitions
setting can be set across different joins, it doesn't need to be a constant across the Spark job.
Calculating partition size become difficult if multiple tables are involved. In that case, writing to disk and reading back before large tables is a good idea.
For small tables, less than 2GB, broadcasting is a possibility. The default limit is 10MB (I think) but it can be changed.
It would be best if the final table is stored on disk rather than serving thrift clients through temp tables.
Good luck!