Scheduled Job Task

Question

Subject:

I’m trying to implement a basic job scheduling in Java to handle recurrent persisted scheduled task (for a personal learn project). I don’t want to use any (ready-to-use) libraries like Quartz/Obsidian/Cron4J/etc.

Objective:

Job have to be persistent (to handle server shutdown)
Job execution time can take up to ~2-5 mn.
Manage a large amount of job
Multithread
Light and fast ;)

All my job are in a MySQL Database.

JOB_TABLE (id, name, nextExecution,lastExecution, status(IDLE,PENDING,RUNNING))

Step by step:

Retrieve each job from “JOB_TABLE” where “nextExecution > now” AND “status = IDLE“. This step is executed every 10mn by a single thread.
For each job retrieved, I put a new thread in a ThreadPoolExecutor then I update the job status to “PENDING” in my “JOB_TABLE”.
When the job thread is running, I update the job status to “RUNNING”.
When the job is finished, I update the lastExecution with current time, I set a new nextExecution time and I change the job status to “IDLE”.

When server is starting, I put each PENDING/RUNNING job in the ThreadPoolExecutor.

Question/Observation:

Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000) ?
Should I use a NoSQL solution instead of MySQL ?
Is it the best solution to deal with such use case ?

This is a draft, there is no code behind. I’m open to suggestion, comments and criticism!

"I don’t want to use any (ready-to-use) libraries like Quartz/Obsidian/Cron4J/etc" - why not ? — Brian Agnew, Feb 25 '14 at 10:37
Cause it's a study project to improve my Java knownledge ! ;) — user3350705Ol, Feb 25 '14 at 10:40
It so then analyses the code of those libs. Then you can learn more than reinventing the wheel. — Damian Leszczyński - Vash, Feb 25 '14 at 10:51
@Vash : I will not necessarily code it, what really matters to me is to understand the way to handle such case. (issues of a such case, architecture and solution to handle it) — user3350705Ol, Feb 25 '14 at 11:16

Ivaylo Slavov · Accepted Answer · 2014-02-25T11:18:36.433

I have done similar to your task on a real project, but in .NET. Here is what I can recall regarding your questions:

Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000)?

We discovered that .NET's built-in thread pool was the worst approach, as the project was a web application. Reason: the web application relies on the built-in thread pool (which is static and thus shared for all uses within the running process) to run each request in separate thread, while maintain effective recycling of threads. Employing the same thread pool for our internal processing was going to exhaust it and leave no free threads for the user requests, or spoil their performance, which was unacceptable.

As you seem to be running quite a lot of jobs (20k is a lot for a single machine) then you definitely should look for a custom thread pool. No need to write your own though, I bet there are ready solutions and writing one is far beyond what your study project would require* ^{see the comments} (if I understand correctly you are doing a school or university project).

Should I use a NoSQL solution instead of MySQL?

Depends. You obviously need to update the job status concurrently, thus, you will have simultaneous access to one single table from multiple threads. Databases can scale pretty well to that, assuming you did your thing right. Here is what I refer to doing this right:

Design your code in a way that each job will affect only its own subset of rows in the database (this includes other tables). If you are able to do so, you will not need any explicit locks on database level (in the form of transaction serialization levels). You can even enforce a liberal serialization level that may allow dirty or phantom reads - that will perform faster. But beware, you must carefully ensure no jobs will concur over the same rows. This is hard to achieve in real-life projects, so you should probably look for alternative approaches in db locking.
Use appropriate transaction serialization mode. The transaction serialization mode defines the lock behavior on database level. You can set it to lock the entire table, only the rows you affect, or nothing at all. Use it wisely, as any misuse could affect the data consistency, integrity and the stability of the entire application or db server.
I am not familiar with NoSQL database, so I can only advice you to research on the concurrency capabilities and map them to your scenario. You could end up with a really suitable solution, but you have to check according to your needs. From your description, you will have to support simultaneous data operations over the same type of objects (what is the analog for a table).

Is it the best solution to deal with such use case ?

Yes and No.

Yes, as you will encounter one of the difficult tasks developers are facing in real world. I have worked with colleagues having more than 3 times my own experience and they were more reluctant to do multi-threading tasks than me, they really hated that. If you feel this area is interesting to you, play with it, learn and improve as much as you have to.
No, because if you are working on a real-life project, you need something reliable. If you have so many questions, you will obviously need time to mature and be able to produce a stable solution for such a task. Multi-threading is a difficult topic for many reasons:
- It is hard to debug
- It introduces many points of failure, you need to be aware of all of them
- It could be a pain for other developers to assist or work with your code, unless you sticked to commonly accepted rules.
- Error handling can be tricky
- Behavior is unpredictable / undeterministic.
There are existing solutions with high level of maturity and reliability that are the preferred approach for real projects. Drawback is that you will have to learn them and examine how customizable they are for your needs.

Anyway, if you need to do it your way, and then port your achievement to a real project, or a project of your own, I can advice you to do this in a pluggable way. Use abstraction, programming to interfaces and other practices to decouple your own specific implementation from the logic that will set the scheduled jobs. That way, you can adapt your api to an existing solution if this becomes a problem.

And last, but not least, I did not see any error-handling predictions on your side. Think and research on what to do if a job fails. At least add a 'FAILED' status or something to persist in such case. Error handling is tricky when it comes to threads, so be thorough on your research and practices.

Good luck

score 1 · Answer 2 · edited May 23 '17 at 11:44

You can declare the maximum pool size with ThreadPoolExecutor#setMaximumPoolSize(int). As Integer.MAX is larger 20000 then technically yes it can.

The other question is that does your machine wold support so many thread to run. You will have provide enough RAM so each tread will allocate on stack.

Thee should not be problem to address ~20,000 threads on modern desktop or laptop but on mobile device it could be an issue.

From doc:

Core and maximum pool sizes

A ThreadPoolExecutor will automatically adjust the pool size (see getPoolSize()) according to the bounds set by corePoolSize (see getCorePoolSize()) and maximumPoolSize (see getMaximumPoolSize()). When a new task is submitted in method execute(java.lang.Runnable), and fewer than corePoolSize threads are running, a new thread is created to handle the request, even if other worker threads are idle. If there are more than corePoolSize but less than maximumPoolSize threads running, a new thread will be created only if the queue is full. By setting corePoolSize and maximumPoolSize the same, you create a fixed-size thread pool. By setting maximumPoolSize to an essentially unbounded value such as Integer.MAX_VALUE, you allow the pool to accommodate an arbitrary number of concurrent tasks. Most typically, core and maximum pool sizes are set only upon construction, but they may also be changed dynamically using setCorePoolSize(int) and setMaximumPoolSize(int).

More

About the DB. Create a solution that is not depend to DB structure. Then you can set up two enviorements and measure it. Start with the technology that you know. But keep open to other solutions. At the begin the relations DB should keep up with the performance. And if you mange it properly the it should not be an issue later. The NoSQL are used to work with really big data. But the best for you is to create both and run some performace tests.

In term of memory, Will I have some issues with 1M jobs in the ThreadPool queue? - You right about database abstraction, but based on your experiences, Will I deal with some problems with such architecture ? (on a small VM) — user3350705Ol, Feb 25 '14 at 11:05
See the edit. But the base line is that it depends. Regarding the design i will say the same. It depend of how do you tune the db. At this point you should start with the idea you have. Then measure it and try to improve in % scale. The main vision of you project is to execute the task in valid moment how it will be done is just technical detail. Rad some lecture about thread management and then start to create a prototype. — Damian Leszczyński - Vash, Feb 25 '14 at 12:36

Scheduled Job Task

2 Answers2