Background:
I built a scraper in python (not sure if that matters). I scrape the website and update my html table. The main table stores the autogenerated_id, url, raw_html, date_it_was_scrapped, last_date_the_page_was_updated (provided by the website). My table has many duplicate urls which it shouldnt so i am planning on making urls unique in the database.
Desired outcome:
I only want to insert a row if the url doesnt exist and update the html if last_date_the_page_was_updated > date_it_was_scrapped.
Solution:
The following stackoverflow post shows how.
I havent tested it because of the selected answers warning: INSERT ... ON DUPLICATE KEY UPDATE statement against a table having more than one unique or primary key is also marked as unsafe.
What I plan to do based on the stackoverflow question.
INSERT INTO html_table (url, raw_html, date_it_was_scrapped, last_date_the_page_was_updated)
VALUES (the data)
ON DUPLICATE KEY UPDATE
url = VALUES(url),
raw_html = VALUES(raw_html),
date_it_was_scrapped = VALUES(date_it_was_scrapped),
last_date_the_page_was_updated=VALUES(last_date_the_page_was_updated)
WHERE last_date_page_was_update > date_it_was_scrapped
Question:
What is unsafe about it and is there a safe way to do it?