I created a spark application that computes a lot of new column (aggregations like sum, avg, even UDF) using a Window. I can hardly copy-paste its code because it is huge.
I noticed a strange behavior: with all processing I got a lot of logs:
INFO CodeGenerator: Code generated in 10.686188 ms
When I test locally, I got about 50 of these logs between the usual
19/10/25 09:22:33 INFO Executor: Finished task 35.0 in stage 3.0 (TID 232). 9013 bytes result sent to driver
19/10/25 09:22:33 INFO TaskSetManager: Starting task 36.0 in stage 3.0 (TID 233, localhost, executor driver, partition 36, PROCESS_LOCAL, 7743 bytes)
19/10/25 09:22:33 INFO Executor: Running task 36.0 in stage 3.0 (TID 233)
19/10/25 09:22:33 INFO TaskSetManager: Finished task 35.0 in stage 3.0 (TID 232) in 1404 ms on localhost (executor driver) (31/200)
19/10/25 09:22:33 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 200 blocks
19/10/25 09:22:33 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
If I remove some processing (I tested many cases, this isn't an aggregation or an UDF in particular) then these logs disappear. I found a border case where, disabling a single sum() removed these logs and it goes about 2x faster.
I also could remove these logs by saving to temporary parquet between some processing.
I feel like, with a huge number of columns or aggregations, the execution plan has a kind of problem and generates some code at each stage, where it could be avoid. This is a real performance issue for me.
Does it sound something possible? How to avoid it?