Concurrency for select query in Redshift

Question

We have a table in Redshift:

people

people_id    people_tele      people_email        role
1            8989898332       john@gmail.com      manager
2            8989898333       steve@gmail.com     manager
3            8989898334       andrew@gmail.com    manager
4            8989898335       george@gmail.com    manager

I have a few users who would query the table like:

select * from people where role = 'manager' limit 1;

The system users are basically phone calling these people for up-selling products. So, when the query return results, it should not return same people ever.

For ex.

If User A executes the query - select * from people where role = 'manager' limit 1;, then he should get the result:

people_id    people_tele      people_email        role
1            8989898332       john@gmail.com      manager

If User B executes the query - select * from people where role = 'manager' limit 1;, then he should get the result:

people_id    people_tele      people_email        role
2            8989898333       steve@gmail.com     manager

APPROACH 1

So, I thought of adding a is_processed column to not return the same results. So, after User A executes the query, the table would look something like:

people_id    people_tele      people_email        role         is_processed
1            8989898332       john@gmail.com      manager      1
2            8989898333       steve@gmail.com     manager      0
3            8989898334       andrew@gmail.com    manager      0
4            8989898335       george@gmail.com    manager      0

APPROACH 2

Another thought was to create another table called - query_history where I have:

query_id   people_id     processed_time
1          1             22 Jan 2020, 4pm
2          2             22 Jan 2020, 5pm

QUESTION

My question is what happens when User A and User B queries at the EXACT same time? The system would return the same people_id at that moment and 2 phone calls would be made to the same person.

How can I solve the concurrency problem?

What happens depends on how you have implemented the update to the `is_processed` column. — Gordon Linoff, Jan 20 '20 at 15:24
@GordonLinoff - what would be the best way to implement this so the concurrency issue is not there? — dang, Jan 20 '20 at 15:25
. . Lock the entire table every time you want to fetch a value. However, that might not be feasible in a high-transaction environment. — Gordon Linoff, Jan 20 '20 at 15:40
Instead of storing `0`/`1` in `is_processed`, you could store the name of the user to whom the record is allocated. If there is a race condition, then the later update will win. That's fine, since it won't show the name of the earlier user. By the way, Amazon Redshift is designed to be a Data Warehouse, not a transaction processing database. Making such minor updates to records has an impact on the efficiency of storage and querying on the tables. — John Rotenstein, Jan 21 '20 at 01:09
why are you using redshift for a transactional application like this? can you convert just this part to e.g. postgres and set up a dms to keep redshift aligned? — Jon Scott, Jan 21 '20 at 07:12
@JonScott what is DMS? Redshift is our data warehouse where we store all people to call. It has over 30 million records. The problem is that we want the data to be picked only once. — dang, Jan 21 '20 at 08:37
Why did you choose a data warehouse for the backend of your transaction system? unless this is 99.9% for data warehouse analytical queries and just has this one exceptional use case for transactional then you have made the wrong choice. DMS (Database Migration Service) is an AWS service that (amongst other use cases) lets you replicate your transactional data (e.g. postgres) in to redshift for your analytics and reporting to then use. I realise this doesn't actually directly address your question but it is the — Jon Scott, Jan 21 '20 at 08:52
If I use Redshift for this transactional use case, what are the disadvantages? — dang, Jan 21 '20 at 08:54
Disadvantages : Redshift is very slow and inefficient for transactional. It has no indexes and has to pre-compile (can be slow). Its row level locking is inferior. There are no enforced constraints in Redshift. Redshift costs a LOT more if you just want transactional use cases!! — Jon Scott, Jan 21 '20 at 08:56
Adding a is_processed is not a good design. You need to have transaction (Redshift is not good at it) and overall the pattern is hard to maintain. I need a bit more information like - 'How long do you not want to contact the same person?' You could use a random function to randomly select a person otherwise — Mangat Rai Modi, Jan 23 '20 at 12:02
Also you would need an external system to handle the concurrency here, as for redshift they are separate independent queries. — Mangat Rai Modi, Jan 23 '20 at 12:03

score 2 · Answer 1 · answered Jan 23 '20 at 12:17

2

You can solve it with your Approach 1 only with adding Randomisers in it

SELECT * FROM people 
WHERE role = 'manager' 
AND is_processed = 0
order by random()
limit 1;

Refer: https://docs.aws.amazon.com/redshift/latest/dg/r_RANDOM.html

answered Jan 23 '20 at 12:17

Ronak Dhoot

2,322
1
12
19

score 0 · Accepted Answer · answered Jan 23 '20 at 17:12

0

Maybe you can solve with transactions ? try some try/catch maneuver.

Transaction MySQL

edit: Sorry, for some reason i thought you are working with MySQL. https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-transaction-management.html

answered Jan 23 '20 at 17:12

Aleksandar

84
1
4

while these links might answer the question it is b est to post more of the answer using the tools on stackoverflow – sao Jan 23 '20 at 17:22

Concurrency for select query in Redshift

2 Answers2