38

I am running many instances of a webcrawler in parallel.

Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain.

Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl.

I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in the log table (two queries).

How the heck does one do this? I'm afraid this is terribly complex and relies on many other things. Please help get me started.


This code seems like a good solution (see the error below, however):

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT companies.id FROM companies
        LEFT OUTER JOIN crawlLog
        ON companies.id = crawlLog.companyId
        WHERE crawlLog.companyId IS NULL
        LIMIT 1
    ),
    now()
)

but I keep getting the following mysql error:

You can't specify target table 'crawlLog' for update in FROM clause

Is there a way to accomplish the same thing without this problem? I've tried a couple different ways. Including this:

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT id
        FROM companies
        WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
    ),
    now()
)
OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
T. Brian Jones
  • 13,002
  • 25
  • 78
  • 117

6 Answers6

57

You can lock tables using the MySQL LOCK TABLES command like this:

LOCK TABLES tablename WRITE;

# Do other queries here

UNLOCK TABLES;

See:

http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

qbert220
  • 11,220
  • 4
  • 31
  • 31
  • When other processes try to read or write to the table during the lock, they will get a mysql error. – Albert Hendriks Jul 06 '15 at 05:36
  • @AlbertHendriks from the docs: "Lock requests for the table by other sessions block while the WRITE lock is held." – Jonah Oct 11 '17 at 21:50
  • Yes, so if they just try to query they'll get an error, but if they try to acquire a lock it will wait. – Albert Hendriks Oct 12 '17 at 12:41
  • 19
    @AlbertHendriks queries against a table that is locked are not rejected with an error (unless, I suppose, the connection is tied to a web service or something and times out). The lock just puts a hold on any other connections wanting to access that table, until the lock is removed. The connections will automatically resume accessing the table as soon as the table is unlocked. – David Mordigal Jan 07 '18 at 05:18
  • @DavidMordigal, will any insert query that runs against a locked table be automatically executed when the table is unlocked? – R.S.K Aug 11 '21 at 10:20
5

Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATE within a transaction.

BEGIN;

SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE

# do whatever you have to do

COMMIT;

Please note that you will need an index on domainname (or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.

wonk0
  • 13,402
  • 1
  • 21
  • 15
  • I'm little late to the party but what if I don't have any update query ? When exactly a locking will be released for this ? @wonk0 – Rupesh Bhandari Jan 28 '21 at 08:28
4

You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."

Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:

START TRANSACTION;
SELECT @URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
INSERT INTO loggingtable SET url=@URL;
COMMIT;

Or something close to that.

[edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:

INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.
ratsbane
  • 890
  • 6
  • 7
  • I don't see how that will help; what will keep two parallel instances from reading the same url? – wonk0 Jul 08 '11 at 07:57
  • You're right - it won't. See the second example I just edited in. It joins the logging table on the URL table, ensuring that there's no record in the logging table (l.url IS NULL). – ratsbane Jul 08 '11 at 08:00
  • 1
    Yep, that's better. But one has to keep in mind that this will definitely not work within a transaction (depending on the isolation level) – wonk0 Jul 08 '11 at 08:23
2

I wouldn't use locking, or transactions.

The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.

Assume you have tblcrawels (cra_id) that is filled with your crawlers and tblurl (url_id) that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id) for your logfile.

You would run the following query if crawler 1 wants to start crawling url 2:

INSERT INTO tbllogging (log_cra_id, log_url_id) 
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url 
WHERE url_id=2 AND log_url_id IS NULL;

The next step is to check whether this record has been inserted.

SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1

If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Eljakim
  • 6,877
  • 2
  • 16
  • 16
  • I keep getting this error trying to implement what you suggested: "You can't specify target table 'crawlLog' for update in FROM clause" -- see my updated question for my code. – T. Brian Jones Jul 08 '11 at 20:28
  • I am missing part of what you suggested, however, because I'm not quite sure about part of your code. What are the 1 and 2 doing in there? – T. Brian Jones Jul 08 '11 at 20:33
  • The 1 is the id of the crawler, the 2 is the id of the url. They come from the tables tblcrawlers and tblurl. – Eljakim Jul 12 '11 at 09:23
2

I got some inspiration from @Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.

INSERT INTO crawlLog (companyId, timeStartCrawling)
SELECT id, now()
FROM companies
WHERE id NOT IN
(
    SELECT companyId
    FROM crawlLog AS crawlLogAlias
)
LIMIT 1
Community
  • 1
  • 1
T. Brian Jones
  • 13,002
  • 25
  • 78
  • 117
  • 1
    You then do a `SELECT LAST_INSERT_ID();` to get ID of the log entry which has the domain the crawler should crawl, don't you? – tonix Dec 13 '17 at 18:22
  • 2
    I also don't think this is reliable. I have a multithreaded application similarly trying to insert unique jobs on a cron timer. If they fire too close together -- like down to the 3-digit microsecond -- they'll still insert despite the WHERE check. – Rikaelus Mar 19 '18 at 04:44
  • @Rikaelus I have the same exact issue where in a multithreaded application the insert still happens despite the "where not in" clause. How did you go about resolving your issue? – 1Mojojojo1 Aug 29 '18 at 16:26
  • @PramodRoy Difficult to explain in a comment but it was some DBA-fu. I INSERT with a repeatable identifier and a "claiming" column set to 1. Those two columns are a unique key. I wait 5 seconds and then set the "claiming" column to NULL, then do the job. Any other thread/process that tries to use that same identifier and claiming=1 throws an IntegrityError exception (duplicate key) which I catch and just abort. It takes advantage of how NULL is handled in unique keys. This may not work for applications in which that "wait" time isn't feasible. – Rikaelus Aug 29 '18 at 20:44
  • @Rikaelus That sounds like a few changes to the schema. For me however, since my application is not very data intensive, Acquiring a lock on the table was the only solution for me. This does not reduce the throughput by a significant amount for me since I am only acquiring the table lock in specific scenarios and only for one table in the database. – 1Mojojojo1 Aug 30 '18 at 22:42
  • @PramodRoy Things like this, you use what works. I can't remember the particulars now but I had tried locks and they weren't working for my application. Glad you got a solution, though. – Rikaelus Aug 31 '18 at 02:22
  • @Rikaelus why not just introduce random time delay (between 1 and 5 seconds) just before sending query from app to DB? Every crawler, launched by cron, will have a different 'query start' time. ? – Andrew Aug 12 '20 at 15:14
0

It's better to use row lock or transactional based query so that other parallel request context can access the table.

Hims1911
  • 33
  • 5