1

I have a web crawler. The web crawler gathers the links from web pages I give it but when it is retrieving the links, some links are duplicated due to the website. is there a way in MYSQL to overwrite data if a new row is the exact same as an old row.

Say if I have http://www.facebook.com in a link field

I also manage to pick up http://www.facebook.com again, I would like the latter to overwrite the old row. therefore I don't have clashes on my search engine.

Kara
  • 6,115
  • 16
  • 50
  • 57
Noah Smith
  • 203
  • 4
  • 9

4 Answers4

2

I'm assuming that you want to update a last_updated date if the url already exists. Else there is no good reason to do an update.

 INSERT INTO `scrapping_table`
    (`url`)
    VALUES
    ("www.facebook.com")
    ON DUPLICATE KEY UPDATE
    `date_updated` = `datetime.now()`
Pratik Mandrekar
  • 9,362
  • 4
  • 45
  • 65
  • i have the table, 'pages' and can you explain that a bit more – Noah Smith Aug 16 '12 at 17:54
  • So your `pages` table has a column `url` which stores the `facebook.com`. Now you are crawling the web again and your crawler finds `facebook.com` again. Updating the `url` now is a redundant step unless of course you can store some more meta information about this new crawling by storing the `date` you performed this update. – Pratik Mandrekar Aug 16 '12 at 18:04
  • In larger systems like Google, crawling can take a few weeks and storing the `last time I saw this url` information is useful to the search algorithm and so that the crawler does not keep going over the same urls. On a slightly different note, Google uses `inverted indices` which is a good strategy for building a full-text search. – Pratik Mandrekar Aug 16 '12 at 18:06
1

look into ON DUPLICATE KEY actions

http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html

Basically make the columns you're concerned with a unique key write your insert statement and then add

ON DUPLICATE KEY UPDATE col = overwriting value
invertedSpear
  • 10,864
  • 5
  • 39
  • 77
0

If your link field is unique than you can use

 INSERT INTO "mytable" (link_field, x_column, y_column) VALUES ("www.facebook.com",'something new for x','something new for y')
 ON DUPLICATE KEY UPDATE x_column='something new for x', y_column='something new for y'

Just make sure your link field is unique and if you have more unique fields in your column, I suggest use this second method because they suggest avoid using an ON DUPLICATE KEY clause on tables with multiple unique indexes.

  1. set your link field as unique.
  2. before inserting a row try

    Select "primary_id" from mytable where link_field="www.facebook.com"
    
  3. Count the number of returned row from this SQL.

    =>If count>0 then UPDATE the row using the "primary_id" we just grabbed through the SELECT SQL

    => if count==0 , just insert your row

Potheek
  • 991
  • 1
  • 8
  • 17
0

beware!!

while operating a web crawler that probably will find millions of links you want to minimize the query's each "crawl" process fires... do you want to create a unique link table that will feed the bots? or do you want to prevent duplicate search results?

unique url pool table:

  • while crawling the page - you should save url's to an array (or list) and making sure (!in_array()) that the its a unique value array, you will find that each page you crawl includes alot of repeated links - so clean them before using sql.
  • covert the urls to hashes ("simhash" of 32 digits [1,0]).
  • now open a connection to db and check if exists if it does dump them! don't update (its making a second process). you should match the links using the hashes over an indexed table it will be far more faster.

prevent duplicate results search:

  • if you indexed the url in the above methodology you should not find duplicate url's, if you have, it means there is a problem in your crawling operation.

  • even if you have duplicate values in another table and you want to search it but not returning duplicate results you can use DISTINCT in your query.

good luck!

Shlomi Hassid
  • 6,500
  • 3
  • 27
  • 48