What if you created a class which would format dates using a fixed size pool of precreated SimpleDateFormat objects in round-robin fashion? Given that uncontested synchronization is cheap, this could synchronize on the SimpleDateFormat object, amortizing collisions across the total set.
So there might be 50 formatters, each used in turn - collision, and therefore lock contention, would occur only if 51 dates were actually formatted simultaneously.
EDIT 2011-02-19 (PST)
I implemented a fixed pool as suggested above, the code for which (including the test), is available on my website.
Following are the results on a Quad Core AMD Phenom II 965 BE, running in the Java 6 SE client JVM:
2011-02-19 15:28:13.039 : Threads=10, Iterations=1,000,000
2011-02-19 15:28:13.039 : Test 1:
2011-02-19 15:28:25.450 : Sync : 12,411 ms
2011-02-19 15:28:37.380 : Create : 10,862 ms
2011-02-19 15:28:42.673 : Clone : 4,221 ms
2011-02-19 15:28:47.842 : Pool : 4,097 ms
2011-02-19 15:28:48.915 : Test 2:
2011-02-19 15:29:00.099 : Sync : 11,184 ms
2011-02-19 15:29:11.685 : Create : 10,536 ms
2011-02-19 15:29:16.930 : Clone : 4,184 ms
2011-02-19 15:29:21.970 : Pool : 3,969 ms
2011-02-19 15:29:23.038 : Test 3:
2011-02-19 15:29:33.915 : Sync : 10,877 ms
2011-02-19 15:29:45.180 : Create : 10,195 ms
2011-02-19 15:29:50.320 : Clone : 4,067 ms
2011-02-19 15:29:55.403 : Pool : 4,013 ms
Notably, cloning and pooling were very close together. In repeated runs, cloning was faster than pooling about as often as it was slower. The test, of course, was deliberately designed for extreme contention.
In the specific case of the SimpleDateFormat, I think I might be tempted to just create a template and clone it on demand. In the more general case, I might be tempted to use this pool for such things.
Before making a final decision one way or the other, I would want to thoroughly test on a variety of JVMs, versions and for a variety of these kinds of objects. Older JVMs, and those on small devices like handhelds and phones might have much more overhead in object creation and garbage collection. Conversely, they might have more overhead in uncontested synchronization.
FWIW, from my review of the code, it seemed that SimpleDateFormat would most likely have the most work to do in being cloned.
EDIT 2011-02-19 (PST)
Also interesting are the uncontended single-threaded results. In this case the pool performs on par with a single synchronized object. This would imply that the pool is the best alternative overall, since it delivers excellent performance when contented and when uncontended. A little surprising is that cloning is less good when single threaded.
2011-02-20 13:26:58.169 : Threads=1, Iterations=10,000,000
2011-02-20 13:26:58.169 : Test 1:
2011-02-20 13:27:07.193 : Sync : 9,024 ms
2011-02-20 13:27:40.320 : Create : 32,060 ms
2011-02-20 13:27:53.777 : Clone : 12,388 ms
2011-02-20 13:28:02.286 : Pool : 7,440 ms
2011-02-20 13:28:03.354 : Test 2:
2011-02-20 13:28:10.777 : Sync : 7,423 ms
2011-02-20 13:28:43.774 : Create : 31,931 ms
2011-02-20 13:28:57.244 : Clone : 12,400 ms
2011-02-20 13:29:05.734 : Pool : 7,417 ms
2011-02-20 13:29:06.802 : Test 3:
2011-02-20 13:29:14.233 : Sync : 7,431 ms
2011-02-20 13:29:47.117 : Create : 31,816 ms
2011-02-20 13:30:00.567 : Clone : 12,382 ms
2011-02-20 13:30:09.079 : Pool : 7,444 ms