6

Just wonder if anyone is aware of this warning info

18/01/10 19:52:56 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance

I've seen this a lot when trying to load some big dataframe with many partitions from S3 into spark.

It never really causes any issues to the job, just wonder what is the use of that config property and how to tune it properly.

Thanks

Mark Rajcok
  • 362,217
  • 114
  • 495
  • 492
seiya
  • 1,477
  • 3
  • 17
  • 26

1 Answers1

5

In answer to your question, this is a spark-hive specific config property which, when nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.

In spark source code it is written like the following. The default size is 250 * 1024 * 1024 as per code which you can try to manipulate by your SparkConf object in your code/in spark-submit command.

Spark Source Code

val HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE =
    buildConf("spark.sql.hive.filesourcePartitionFileCacheSize")
      .doc("When nonzero, enable caching of partition file metadata in memory. All tables share " +
           "a cache that can use up to specified num bytes for file metadata. This conf only " +
           "has an effect when hive filesource partition management is enabled.")
      .longConf
      .createWithDefault(250 * 1024 * 1024)
Gourav Dutta
  • 533
  • 4
  • 10
  • Thanks, @Gourav Dutta for the quick answer. Now my question is: is there anyway to make loading partitions faster? The data frame data on S3 has many partitions (it's partitioned by year, month, day and hour), so the loading process would take over 20 minutes. – seiya Jan 12 '18 at 17:19
  • @seiya To load highly partitioned data quicker try creating a meta data store using an AWS Glue Crawler. – Nick Jan 23 '18 at 12:59
  • Thanks @Nick for the suggestion, AWS Glue Crawler is definitely worth looking at. Will try it and see how it goes. – seiya Jan 23 '18 at 20:37
  • @seiya In order to make it faster - use the fastest method you have to get filenames from S3 and then create dataframe from that file list. – Alex B Sep 06 '20 at 00:53