0

i've created a data governance tool, in one module of which i schedule a job that runs a few sql queries. It is now running synchronously but i want to convert it to a way that it runs faster.

the question is, should i use Parallel or Task or Async/Await method in order to achieve what i want?

EDIT:these queries are completely different from each other. They each run a different data quality checks on different columns or tables. These are all select queries, and i use the ExecuteScalar method returning only the count of the rows that breaks my quality rules, such as "how many null values in ABC column" I scheduled them to run at night as soon as DWH is populated

Volkan Yurtseven
  • 425
  • 3
  • 15
  • Whatever you want. – Blue Oct 15 '18 at 08:31
  • 1
    `async` won't make anything run faster, it will *release* a thread that would be blocked otherwise. Individual queries will be *slower* than their sync versions though. Parallel execution will *decrease* performance due to increased locking. Batching queries or passing multiple values in `IN()` or `VALUES()` clauses is a lot faster. SqlBulkCopy for bulk inserts is another fast option – Panagiotis Kanavos Oct 15 '18 at 08:31
  • It depends on what are these SQL queries. Could you share more info on these queries and why do they need to be run on schedule? – Yeldar Kurmangaliyev Oct 15 '18 at 08:33
  • hi, i edited my post, giving details about the queries – Volkan Yurtseven Oct 15 '18 at 08:37
  • @ExcelinEfendisi This doesn't help very much. What are these queries specifically? Select only? Updates? What is "data quality" in terms of database? How are the results of these jobs used then? – Yeldar Kurmangaliyev Oct 15 '18 at 08:38
  • `Parallel` and `async/awai` use Tasks in any case. None of them has anything to do with scheduling. What happens when one job takes 1-2 minutes and you need to start *another* job? You can use `Task.Run` to start each individual job on schedule but the real problem will be scheduling – Panagiotis Kanavos Oct 15 '18 at 08:41
  • @YeldarKurmangaliyev, i put more info now – Volkan Yurtseven Oct 15 '18 at 08:41
  • @ExcelinEfendisi you should either use the database's built-in job scheduler, eg SQL Server Agent, or use a scheduling library like Hangfire or Quartz.NET. In SQL Server you could create one SSIS package with all quality checks and schedule the entire package instead of individual jobs – Panagiotis Kanavos Oct 15 '18 at 08:42
  • @ExcelinEfendisi in short, whether you use Task.Run or not is more of an implementation detail. There are other, more important concerns, especially when you want to run quality checks *after* an ETL job. – Panagiotis Kanavos Oct 15 '18 at 08:45
  • @PanagiotisKanavos, i query Oracle and i write the number of rule violations into Ms Access. And i dont have the access for neither DB schedulers nor Informatica ETL tool. i need to do the scheduling programattically in my own tool. all i want is, all the querying takes 2 hours to run and i want to reduce the duration – Volkan Yurtseven Oct 15 '18 at 09:01
  • @ExcelinEfendisi why not use the OS's task scheduler instead of writing your own? What you ask isn't `all I want`. The types of checks will have a significant impact on performance too - if two checks use the same table, you probably *don't* want to run them concurrently and have them contend with each other. OTOH if one query takes 30 minutes and 5 others take 2 each, it would be a good idea to run 1 big query and a *sequence* of small ones in parallel. – Panagiotis Kanavos Oct 15 '18 at 09:09
  • @PanagiotisKanavos, scheduling isn't the key part here. let me change my question this way, how would achieve this by running these querries now, not by scheduling. i just want to know which one the proper way is, Parallel.For or Task/Async? i am asking this , coz i tried to implement the Task/async way but couldn't manage it. and i am wondering if the Task/async method is not suitable for this goal and should i use the Parallel.For instead, or am i just doing something wrong with Task/async – Volkan Yurtseven Oct 15 '18 at 09:39
  • 1
    @ExcelinEfendisi I understand the question because I also work with data warehouses and need to run quality checks. How you start the queries won't affect total execution time as much as finding *which* queries can be run in parallel and which don't. You can run *all* of them in parallelf with `Parallel.ForEach(jobs,job=>runJob(job));`, `var jobTasks=allJobs.Select(job=>Task.Run(()=>runJob(job))` or `jobTasks =await Task.WhenAll(allJobs.Select(job=>runJobAsync(job)))`. *Should* you do so though? Or would this result in *slower* execution due to blocking? All queries run on the disks, RAM – Panagiotis Kanavos Oct 15 '18 at 10:01
  • @PanagiotisKanavos, thanks for your time. i'll try Parallel.For. Meanwhile, i've been reading on parallelism, asynchronicy and concurrency lately, maybe read more than 200 pages, in case that i have misunderstood the terms. i think, asynchronicy does not reduce the time, whereas parallelism can. – Volkan Yurtseven Oct 15 '18 at 10:58
  • this article tells me what i need, https://cpratt.co/async-not-faster/ – Volkan Yurtseven Oct 15 '18 at 18:36

1 Answers1

0

At the first I suggest you to optimize your SQL queries. After that, you need to use Async/await because you don't need to block thread.

Take a look on the following topic:

Async/await and parallel in C#

mohabbati
  • 1,162
  • 1
  • 13
  • 31
  • actually, my code runs at the background since it is scheduled to run at night. so i dont have the concern of blocking the thread. i just want my code to complete as soon as possbile – Volkan Yurtseven Oct 15 '18 at 10:00
  • To my mind at the first you have to optimize sql queries by creating some indexes and other techniques. – mohabbati Oct 15 '18 at 10:06
  • @MohammadMohabbati this isn't the generic "make my queries run faster" question. The OP wants to run quality checks after an ETL job and asks how to avoid running them sequentially. Those types of queries crunch a lot of data and can end up using a lot of RAM, IO bandwidth. Some of them *could* be run in parallel to reduce the total time, as long as they don't contend with each other for resources. – Panagiotis Kanavos Oct 15 '18 at 10:09
  • @MohammadMohabbati which means, how you run the queries isn't as important as what those queries are and which tables they touch. – Panagiotis Kanavos Oct 15 '18 at 10:10
  • @PanagiotisKanavos you are right. I didn't understand the question quite well. – mohabbati Oct 15 '18 at 11:37