2

I have a FileShare crawler (getting permissions and dropping them somewhere for later Audit). Currently it is starting multiple threads to crawl the same folder (to speed up the process).

In C#, each SqlConnection object has its own SqlTransaction, initiated by the SqlConnection.BeginTransaction() call.

Here is the pseudo code of the current solution:

  1. Get list of folders
  2. For each folder get list of sub-folders
  3. For each sub folder start a thread to collect file shares
  4. Each thread will save collected data to database
  5. Run Audit reports on the database

The problem arise when one of the sub folders threads fails. We end up with partial folder scanning which "cannot be detected easily". The main reason is that each thread is running on a separate connection.

I would like to have each folder to be committed in the same transaction rather than having incomplete scanning (current situation, when some threads fail). No transaction concept is implemented but I am evaluating the options.

Based on the comments of this answer, the producer/consumer queue would be an option but unfortunately memory is a limit (due to the number of started threads). In case the producer/consumer space is committed to disk to overcome the RAM limit, the execution time will go up (due to the very limited disk I/O compared to memory I/O). I guess I am stuck with a memory/time compromise. Any other suggestions?

Community
  • 1
  • 1
Moslem Ben Dhaou
  • 6,897
  • 8
  • 62
  • 93
  • It's not really clear what you're asking - if you don't have enough memory to hold everything you want in a single transaction, I doubt that there's going to be a solution to that. It's also not clear what you mean by "in case the producer/consumer space are committed to disk". – Jon Skeet May 30 '14 at 10:27
  • 2
    If you are in such considerations, I would think your transaction would run for a long time as well and thereby possible locking the data/resources for a too long time causing issues. I think you'll need a different approach altogether. – Allan S. Hansen May 30 '14 at 10:29
  • @JonSkeet Please have a look at the edited question. I hope it is clear now – Moslem Ben Dhaou May 30 '14 at 10:35
  • @AllanS.Hansen I am aware of the potential locking problem. I am using a temp table to collect the data before moving it to the reporting table. Concurrent `INSERT`s should not create an issue. I am open to suggestions – Moslem Ben Dhaou May 30 '14 at 10:38
  • Concurrent inserts will cause an issue if you wrap (custom) transactions around it. And if you're not on top of your transactional logic, then you might even deadlock yourself. But a lot of things regarding solutions depend on your given situation and the factors in it; but if you have a memory limit starting many threads, I'd consider starting fewer threads to begin with and then collecting data in application layer for fewer update/insert into the database and thereby shorter transaction(s). – Allan S. Hansen May 30 '14 at 10:49

1 Answers1

1

It is possible to share the same transaction on multiple connections with SQL Server using the obsolete bind transaction feature. I have never used it and I wouldn't base new development on it. It also seems unnecessary here.

Can't you just have all the producers use the same connection and transaction? Put a lock around it. This obviously bottlenecks the process but it might still be fast enough.

You say you execute INSERT statements. For bulk inserts you can use the SqlBulkCopy class which is very much faster. Batch up the rows and only execute a bulk insert when you have >>1000 rows buffered.

I don't even see the need for producer/consumer here. It would indeed benefit performance by pipelining production with consumption but it also introduces far more complex threading. If you want to go this route you should probably give an IEnumerable<SqlDataRecord> to the SqlBulkCopy class to directly stream all rows that have been produced into it without intermediate buffering.

usr
  • 168,620
  • 35
  • 240
  • 369