0

I wrote a program which needs to process a very large dataset and I'm planning to run it with multiple threads in a high-end machine.

I'm a beginner in Clojure and i'm lost in the myriad of tools at disposal - agents, futures, core.async (and Quartzite?). I would like to know which one is most suited for this job.

The following describes my situation:

  1. I have a function which transforms some data and store it in database.
  2. The argument to the said function is popped from a Redis set.
  3. Run the function in several separate threads as long as there is a value in the Redis set.
vettipayyan
  • 3,150
  • 3
  • 25
  • 34

3 Answers3

3

For simplicity, futures can't be beat. They create a new thread, and return a value from it. However, often you need more fine-grained control than they provide.

The core.async library has nice support for parallelism (via pipeline, see below), and it also provides automatic back-pressure. You have to have a way to control the flow of data such that no one's starving for work, or burdened by too much of it. core.async channels must be bounded, and this helps with this problem. Also, it's a pretty logical model of your problem: taking a value from a source, transforming it (maybe using a transducer?) with some given parallelism, and then putting the result to your database.

You can also go the manual route of using Java's excellent j.u.concurrent library. There are low level primitives as well as thread management tools for thread pools. All of this is accessible within clojure.

From a design standpoint, it comes down to whether you are more CPU-bound or I/O-bound. This affects decisions such as whether or not you will perform parallel reads from redis and writes to your database. If you are CPU-bound and thus your bottleneck is the computation, then it wouldn't make much sense to parallelize your reads from redis, or your writes to your database, would it? These are the types of things to consider.

You really have two problems to solve: (1) your familiarity with clojure's/java's concurrency mechanisms, and (2) your approach to this problem (i.e., how would you approach this problem, irrespective of the language you're using?). Once you solve #2, you will have a much better idea of which tools to use that I mentioned above, and how to use them.

Josh
  • 4,726
  • 2
  • 20
  • 32
1

Sounds like you may have a good embarrassingly parallel problem to solve. In that case, you could start simply by coding up your processing into a top-level function that processes the first datum. Once that's working, wrap it in a map to handle all of the data sequentially (serially, one-at-a-time).

You might want to start tackling the bigger problem with just a few items from your data set. That will make your testing smoother and faster.

After you have the map working, it's time to just add a p (parallel) to your code to make it a pmap. This is a very rewarding way to heat up your machine. Here is a discussion about the number of threads pmap uses.


The above is the simplest approach. If you need finer control over the concurrency, the this concurrency screencast explores the use cases.

Community
  • 1
  • 1
Micah Elliott
  • 9,600
  • 5
  • 51
  • 54
1

It is hard to be precise w/o knowing the details of your problem. There are several choices as you mention:

  • Plain Java threads & threadpools. If your problem is similar to a pre-existing Java solution, this may be the most straightforward.
  • Simple Clojure threading with future et al. Kicking off a thread with future and getting the result in a promise is very easy.
  • Replace map with pmap (parallel map). This can help in simple cases that are primarily map/reduce oriented.
  • The Claypoole library: Lots of tools to make multithreading simpler and easier. Please see their GitHub project and the Clojure/West talk.
Alan Thompson
  • 29,276
  • 6
  • 41
  • 48