You are right to consider concurrency. Unless you have only 1 PHP thread responding to client requests, there's really nothing to stop them each from handing out the same row from data
to be processed - in fact, since they will each run the same query, they'll each almost certainly hand out the same row.
The easiest way to solve that problem is locking, as suggested in the accepted answer. That may work if the time the PHP server thread takes to run the SELECT...FOR UPDATE
or LOCK TABLE ... UNLOCK TABLES
(non-transactional) is minimal, such that other threads can wait while each thread runs this code ( it's still wasteful, as they could be processing some other data row, but more on that later).
There is a better solution, though it requires a schema change. Imagine you have a table such as this:
CREATE TABLE `data` (
`data_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`data` blob,
`status` tinyint(1) DEFAULT '0',
PRIMARY KEY (`data_id`)
) ENGINE=InnoDB;
You don't have any way to transactionally update "the next processed record" because the only field you have to update is status
. But imagine your table looks more like this:
CREATE TABLE `data` (
`data_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`data` blob,
`status` tinyint(1) DEFAULT '0',
`processing_id` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`data_id`)
) ENGINE=InnoDB;
Then you can write a query something like this to update the "next" column to be processed with your 'processing id':
UPDATE data
SET processing_id = @unique_processing_id
WHERE processing_id IS NULL and status = 0 LIMIT 1;
And any SQL engine worth a darn will make sure you don't have 2 distinct processing IDs accounting for the same record to be processed at the same time. Then at your leisure, you can
SELECT * FROM data WHERE processing_id = @unique_processing_id;
and know that you're getting a unique record every time.
This approach also lends it well to durability concerns; you're basically identify the batch processing run per data
row, meaning you can account for each batch job whereas before you're potentially only accounting for the data rows.
I would probably implement the @unique_processing_id by adding a second table for this metadata ( the auto-increment key is the real trick to this, but other data processing metadata could be added):
CREATE TABLE `data_processing` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`data_id` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
and using that as a source for your unique IDs, you might end up with something like:
INSERT INTO data_processing SET date=NOW();
SET @unique_processing_id = (SELECT LAST_INSERT_ID());
UPDATE data
SET processing_id = @unique_processing_id
WHERE status = 0 LIMIT 1;
UPDATE data
JOIN data_processing ON data_processing.id = data.processing_id
SET data_processing.data_id = data.data_id;
SELECT * from data WHERE processing_id = @unique_processing_id;
-- you are now ready to marshal the data to the client ... and ...
UPDATE data SET status = 1
WHERE status = 0
AND processing_id = @unique_processing_id
LIMIT 1;
Thus solving your concurrency problem, and putting you in better shape to audit for durability as well, depending on how you set up data_processing
table; you could track thread IDs, processing state, etc. to help verify that the data is really done being processed.
There are other solutions - a message queue might be ideal, allowing you to queue each unprocessed data object's ID to the clients directly ( or through a php script ) and then provide an interface for that data to be retrieved and marked processed separately from the queueing of the "next" data. But as far as "mysql-only" solutions go, the concepts behind what I've shown you here should server you pretty well.