I am building a simple scraper that connects to the azure server and executes a sql script based on the data collected. I have built my pipelines
using pymssql
, however I get the following error up to this point:
pymssql._pymssql.OperationalError: (2714, b"There is already an object named 'idP' in the database.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n")
What happens I think is because scrapy is single-threaded, then it creates a new database each time. How can I create a database and keep updating it?
Here's my scraper:
import scrapy
from sqlAzure.items import SqlazureItem
class AzureSpider(scrapy.Spider):
name = 'azure'
#allowed_domains = ['x']
start_urls = ['https://books.toscrape.com']
def parse(self, response):
items = SqlazureItem()
container = response.xpath("//ol[@class = 'row']")
for data in container:
for ab in data.xpath('.//p[@class = "price_color"]//text()').getall():
items = SqlazureItem()
items['price'] = int(ab.split('£')[-1].split('.')[0])
yield items
Pipelines:
from itemadapter import ItemAdapter
import pymssql
class SqlazurePipeline:
def __init__(self):
self.conn = pymssql.connect("xxx.database.windows.net", "xxx", "xxx", "xxx")
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
self.cursor.execute("CREATE TABLE idP (price int)",
(item['price']))
self.conn.commit()
return item