0

I am making my own crawler. Now my question is about indexes.

I have 2 columns that is about indexes.

One is pageurl and the other is hashcode colum.

pageurl column is VARCHAR and hashcode column is 64 bit int.

This is the main query I am executing:

SELECT PageId FROM tblPages WHERE HashCode=biginthashcode AND PageUrl='pageurl'

PageId is identity primary key.

Now I was using this as an index:

CREATE nonclustered INDEX indexHashCode ON tblpages (hashcode)
INCLUDE (pageurl,pageid)

But this index above causes too many duplicate rows because of the multi-threaded nature of the software. Probably caused by the SQL delays.

So I have to make it either like below

CREATE UNIQUE nonclustered INDEX indexHashCode ON tblpages (hashcode,pageurl)
INCLUDE (pageid)

Or somehow make it to do not add duplicate values. Which are duplicate values?

Duplicate value means both hashcode and pageurl is same. Is that possible without creating unique index like above with my first index?

I'm using Microsoft SQL Server 2008.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342
  • 4
    If you're getting duplicate rows because of your multi-threading, doesn't that indicate that your multiple threads are all doing the same work as each other, which is rather wasteful of effort? Your setup isn't clear. For example, is the hash code the hash of the page as presented, or the hash of the URL? If you don't want duplicates in the table for the combination of URL and hash, make an index that prevents that from happening - a unique index on hash and URL. – Jonathan Leffler Feb 05 '12 at 16:23
  • Imagine that 10 process keep crawling pages. They are all finding hundreds of urls from the pages. So most of them being same. Apparently sql server delay causes my software control to whether that url exist or not fails. So i have to make sql server to prevent adding same url more than once. – Furkan Gözükara Feb 05 '12 at 16:35
  • Also, the identity column does not necessarily have to be the primary key. You can use the URL as the primary key if you'd like (if unique). – sam yi Feb 05 '12 at 16:37
  • It causes extreme cpu usage. because of that i am using 64 bit hashing. I tried that. – Furkan Gözükara Feb 05 '12 at 16:41

2 Answers2

3

As Jonathan Leffler's comment points out, you are probably trying to solve the problem too late in the process. Why does your multi-threaded code generate duplicates in the first place?

That said, if the duplicate row creation is unavoidable, using a unique index might be reasonable. You still want to check whether rows exist before inserting them however- usually a stored procedure is used for that. Either way, a transaction that contains both the check and the insert should prevent duplicates.

Finally, for what it is worth, hash codes are not usually stored in a database. They are not unique, so it is possible for two URLs to have the same hash code.

Chris Shain
  • 50,833
  • 6
  • 93
  • 125
  • thanks for answer but seems like you did not understand the issue here. the issue is i am having many crawler processes. they are crawling page and finding the same url many times. I am already checking whether that url at the database or not but apparently the sql server delay causes to fail. So i have to make sql server prevent same value insert. – Furkan Gözükara Feb 05 '12 at 16:37
  • 1
    No, he did. Your code approach is broken. There are nice way to handle this with a queue in the middle and avoid double url insertion to start with. Not exactly black science - just "not junior pogrammer level". And yes, I run a system doing a lot more than your crawler. 96 threads, no doubles. I handle them earlier in the system. – TomTom Feb 05 '12 at 16:44
  • Well instead of telling it should be that way showing some good examples would benefit more. Everybody is able to plan it should be that way everybody is keep telling but there aren't any proper example of it. So i am making my own solutions. For example right now I am able to set my thread counter whatever i want but i could not find even a single example of setting the number of threads you want to keep running all the time. Yes my current system keep certain number of threads alive all the time. – Furkan Gözükara Feb 05 '12 at 16:50
  • 1
    We are here to help you out, but there is an expectation that you are going to use a search engine and some trial and error to figure out how to do exactly what you want. In this case the problem isn't the number of threads- all non-trivial crawlers are multithreaded- but the fact that you have an expectation (no unique rows) that is being violated by *your own code*. All that we are saying is that based on our years of experience in this type of thing, you'll end up with a better result if you try to prevent the code from generating duplicates than if you try to prevent them in the database. – Chris Shain Feb 05 '12 at 17:03
2

It looks like you just want to update if (hashcode,pageurl) already exists. In that case take a look here for your options.

If you don't want to update, but you just don't want to insert new value, you also have many options (query and check if it exists, or modify solution from a given link...or something else), but most likely you are doing something bad in your threads since many threads are doing the same job and for all of them (except one) you will just discard results.

Community
  • 1
  • 1
Aleksandar Vucetic
  • 14,715
  • 9
  • 53
  • 56
  • Actually i am checking whether that hashcode and pageurl combination exist or not. but appearently due to the delay of the sql server multiple threads adding multiple times same hashcode and pageurl combination. So i want sql server prevent adding same hashcode & pageurl combination more than once. – Furkan Gözükara Feb 05 '12 at 16:32
  • Do you wrap your "check if exist" + "insert" code into transaction? If you wrap it into transaction, everything should be fine. – Aleksandar Vucetic Feb 05 '12 at 16:36
  • Nope i don't do that. This is my insert command can you modify it ? http://pastebin.com/wYjCqs2y – Furkan Gözükara Feb 05 '12 at 16:38
  • Where do you check if it already exists? Please take a look at the answers in the SO link I pointed in the answer...but I still think that your threads are doing too much work and you should sync them better :) – Aleksandar Vucetic Feb 05 '12 at 16:42
  • Right now i added to the query if not exists begin insert end. I hope this way there won't be any duplicate. Now just wait and see :) – Furkan Gözükara Feb 05 '12 at 21:59
  • Your solution solved the problem though the cpu usage of sql server increased :D Actually i had another function to retrieve page id before inserting the url. I was calling that to check whether that url inserted or not but delay was apparently causing multiple insert. Now no duplicates. – Furkan Gözükara Feb 06 '12 at 20:46