0

I have a Java 1.5 web application that converts arbitrary PDF files to images. It takes too long to process all pages of even a single PDF in one shot, so I want to process pages on demand.

I've read that I can use an ExecutorService to launch/queue the image generation operation in a new thread as the HTTP requests for particular pages arrive. How do I ensure that I'm not queueing duplicate operations (e.g., two users request the same page from the same PDF) without resorting to a single thread executor? How can I use something like a synchronized list to track which images the worker threads are processing (or, what type of synchronization mechanism can help me track this)?

Community
  • 1
  • 1
quietmint
  • 13,885
  • 6
  • 48
  • 73
  • That's a nice question; there are a couple of bad apporaches to this, but the right one is a bit tricky to come up with. For example, you probably can't simply use synchronization as it would probably introduce a severe performance hit. Let me think about that and see if I can help you... – Powerslave Apr 14 '13 at 01:20
  • *"For example, you probably can't simply use synchronization as it would probably introduce a severe performance hit."* - You would only get a significant performance hit if there were lots of people requesting PDFs at the same time. The chances are that this won't be the case. It is better to do the synchronization conservatively ... and worry about the potential bottlenecks later. – Stephen C Apr 14 '13 at 01:42
  • @StephenC In terms of *premature optimization* I have to agree with you. Still built-in concurrent classes of Java do this better than one sticking to a `synchronized` keyword. – Powerslave Apr 14 '13 at 01:45

2 Answers2

1

You can use a ConcurrentSkipListSet or ConcurrentHashMap to track which PDFs have been processed (and are presumably cached) or are currently being processed. Use a ConcurrentLinkedQueue for your PDF-to-image requests; when a worker thread pulls a request off of the queue it adds it to the Set/Map, if the add succeeds then the thread processes the request, if the add fails then the request was already in the container.

Zim-Zam O'Pootertoot
  • 17,888
  • 4
  • 41
  • 69
  • Why would you explicitly synchronize when `ConcurrentXYZ` is already thread safe (and faster)? Also, between a call to `contains` and `add` can be a concurrent write from another thread; `add` returns a `boolean` value indicating whether the contents changed (the addition happened) or not (the item was already present) so one is better off without using `contains` here - the *compare and set* must be atomic. – Powerslave Apr 14 '13 at 01:47
1

You could use a ConcurrentHashMap<String, Future<String>> with a PDF identifier (e.g file path or so) as the key and a task representing the conversion operation itself as the value.

The putIfAbsent method of ConcurrentHashMap can deal with the question of compare-and-set operation and the isDone method of Future can indicate whether the conversion has finished or not.

When putIfAbsent returns null, it means that the conversion task for a given PDF did not yet exist, thus you need to invoke ExecutorService.submit(Callable<T> task) to fire up your newly created conversion task; otherwise you omit this step and wait for the already existing task to finish.

Mockup:

Future<String> conversionTask = ... // blah
Future<String> existingTask = conversions.putIfAbsent(pdfId, conversionTask);
if (existingTask != null) {
    conversionTask = existingTask;
}
// Either way, conversion is scheduled by now.

The ExecutorService takes care of queueing your conversion requests.

Once a conversion completes, you can retrieve the result via Future<V>.get() method.

Please note that spawning threads within a Java EE application is not permitted by the specification. A common approach is to separate your asynchronous processing as a JMS service - Apache Camel can help you here.

Arjan Tijms
  • 37,782
  • 12
  • 108
  • 140
Powerslave
  • 1,408
  • 15
  • 16