0

For practice I'm making a database that scrapes a music rating website to give album, artist, rating.

How do I prevent the same data from being duplicated in my table when I run the script multiple times?

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

import urllib.error
import sqlite3

conn = sqlite3.connect('pitchscraper.sqlite')
cur = conn.cursor()

#create table
cur.execute('''
CREATE TABLE IF NOT EXISTS Albums (id INTEGER, rating INTEGER, name TEXT, url TEXT, artist TEXT)''')

#open and read page
req = Request('http://pitchfork.com/reviews/albums/?page=1', headers={'User-Agent': 'Mozilla/5.0'})
pitchpage = urlopen(req).read()

#parse with beautiful soup
soup = BeautifulSoup(pitchpage, "lxml")
albums = soup('h2')
artists = soup.find_all(attrs={"class" : "artist-list"})

print("ALBUMS")
for tag in albums:
    for album in tag:
        print(album)
        # need to fix this so that duplicate code is not added
        cur.execute('INSERT OR IGNORE INTO Albums (name) VALUES (?)', (album, ))
Frank Harb
  • 79
  • 12
  • How would you as a human notice that the data is already "known" to you? Which values need to be identical to make you think "I know that." ? – Yunnosch Apr 25 '17 at 12:23
  • For example when I run the script, it fills up the albums column with the albums 'Hammersmith', 'Midnight', 'Moment'. When I run it a second time, those 3 albums get added below again - this is what I want to prevent. – Frank Harb Apr 25 '17 at 13:12

0 Answers0