0

I created a spark application that computes a lot of new column (aggregations like sum, avg, even UDF) using a Window. I can hardly copy-paste its code because it is huge.

I noticed a strange behavior: with all processing I got a lot of logs:

INFO CodeGenerator: Code generated in 10.686188 ms

When I test locally, I got about 50 of these logs between the usual

19/10/25 09:22:33 INFO Executor: Finished task 35.0 in stage 3.0 (TID 232). 9013 bytes result sent to driver
19/10/25 09:22:33 INFO TaskSetManager: Starting task 36.0 in stage 3.0 (TID 233, localhost, executor driver, partition 36, PROCESS_LOCAL, 7743 bytes)
19/10/25 09:22:33 INFO Executor: Running task 36.0 in stage 3.0 (TID 233)
19/10/25 09:22:33 INFO TaskSetManager: Finished task 35.0 in stage 3.0 (TID 232) in 1404 ms on localhost (executor driver) (31/200)
19/10/25 09:22:33 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 200 blocks
19/10/25 09:22:33 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

If I remove some processing (I tested many cases, this isn't an aggregation or an UDF in particular) then these logs disappear. I found a border case where, disabling a single sum() removed these logs and it goes about 2x faster.

I also could remove these logs by saving to temporary parquet between some processing.

I feel like, with a huge number of columns or aggregations, the execution plan has a kind of problem and generates some code at each stage, where it could be avoid. This is a real performance issue for me.

Does it sound something possible? How to avoid it?

Rolintocour
  • 2,934
  • 4
  • 32
  • 63

1 Answers1

0

I am sure that with this you will not get any logs

  import org.apache.log4j.{Level, Logger}
  Logger.getLogger("org").setLevel(Level.OFF)
  Logger.getLogger("akka").setLevel(Level.OFF)
  Logger.getLogger("INFO").setLevel(Level.OFF)
chlebek
  • 2,431
  • 1
  • 8
  • 20
  • I already tried by setting log level to WARN, the logs are hidden but the performance issue is still there. So I really think this is related to my data and to spark code generation – Rolintocour Oct 25 '19 at 09:00
  • 1
    If that might help, I improved test performance using ` .config("spark.sql.shuffle.partitions", "1")` and its removed all these log. – Rolintocour Nov 18 '19 at 10:02