0

I'm creating a web crawler that is capturing data from a website and then inserting it into my database. I'm using scrapy and mysql. I created the following code:

  • pipelines.py:

    class MySQLStorePipeline(object):
     def __init__(self):
     self.conn = MySQLdb.connect(host ='localhost', user ='root', passwd ='', db ='imoveis', charset="utf8", use_unicode=True)
     self.cursor = self.conn.cursor()
    
     def process_item(self, item, spider):    
       try:
        self.cursor.execute("""INSERT INTO imovel (Titulo, Tipo_Negocio, Preco, Localizacao, Tipo_Imovel, Condicao, Numero_Divisoes, Numero_Quartos, Numero_Casas_Banho, Certificado_Energetico, Ano_Construcao, Area_Util, Area_Bruta, Piso)  
                    VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""", 
                   (item['Titulo'], 
                    item['Tipo_Negocio'],
                    item['Preco'],
                    item['Localizacao'],
                    item['Tipo_Imovel'],
                    item['Condicao'],
                    item['Numero_Divisoes'],
                    item['Numero_Quartos'],
                    item['Numero_Casas_Banho'],
                    item['Certificado_Energetico'],
                    item['Ano_Construcao'],
                    item['Area_Util'],
                    item['Area_Bruta'],
                    item['Piso']))            
        self.conn.commit()
    
    except MySQLdb.Error as e:
        print('Error %d: %s' % (e.args[0], e.args[1]))
        sys.exit(1)
    
    return item
    
  • settings.py:

    BOT_NAME = 'novo'
    
    SPIDER_MODULES = ['novo.spiders']
    NEWSPIDER_MODULE = 'novo.spiders'
    
    FEED_EXPORT_ENCODING = 'utf-8'
    
    ITEM_PIPELINES = {
      'novo.pipelines.MySQLStorePipeline' : 300
    }
    
    ROBOTSTXT_OBEY = True
    
  • crawler.py:

    class SapoSpider(scrapy.Spider):
     name = "imoveis"
     allowed_domains = ["maisconsultores.pt"]
     start_urls = ["https://www.maisconsultores.pt/properties?page=%d&s=eedce" % i for i in range(23)]
    
     def parse(self,response):
       subpage_links = []
    
       for i in response.css('div.item.col-sm-4'):
        youritem = {
        'Titulo':i.css('div[class=image] h3::text').extract(),
        'Tipo_Negocio':i.css('div.price::text').re('[^\t\n\r\a]+'),
        }
    
        subpage_link = i.css('div[class=image] a::attr(href)').extract_first()
        full_url = response.urljoin(subpage_link)
        yield scrapy.Request(full_url, callback=self.parse_subpage, meta={'item':youritem})
    
    def parse_subpage(self,response):
        youritem = response.meta.get('item')
        youritem['Tipo_Imovel'] = response.xpath('//ul[@class="amenities"]//li[1]/text()').extract()
        youritem['Condicao'] = response.xpath('//ul[@class="amenities"]//li[2]/text()').extract()
        yield youritem
    

The error that appears when I run scrapy is this:

_mysql_exceptions.OperationalError: (1241, 'Operand should contain 1 column(s)')

I really dont know or cant see what I am missing here. I would really appreciate if u guys could help me out.

Joana
  • 1
  • 7
  • @Uueerdo Triple quotes simply allows you to wrap a string over multiple lines. – Barmar Jun 26 '18 at 21:34
  • I don't see how you could be getting that error from the code you posted. The error happens when you use a subquery as an expression, but there are no subqueries in your code. See https://stackoverflow.com/questions/14046838/mysql-operand-should-contain-1-columns for an example. – Barmar Jun 26 '18 at 21:35
  • @Barmar But I did not use any subquery, I simply wanted to insert the data I removed from the site through the crawler and insert it into my database (mysql) through the use of scrapy ... I did not see anything in the link that you sent me to help me :( – Joana Jun 26 '18 at 21:40
  • Where are you calling `process_item()`? Also, the indentation of `pipelines.py` is all messed up, please fix it. – Barmar Jun 26 '18 at 21:44
  • Are you sure the error is coming from `process_item()`? The format of the error message you're getting is not the same as what it writes in the `except` block. – Barmar Jun 26 '18 at 21:46
  • It prints `Error : `, but you're getting `: (, )` – Barmar Jun 26 '18 at 21:47
  • @Barmar I never said that the error came from the process_item () ... I do not know where it comes from ... just when I run the crawler it appears that error and I can not understand why ... – Joana Jun 26 '18 at 21:53
  • @Barmar The problem probably comes from connecting to the database because it is something that appears before such an error appears. – Joana Jun 26 '18 at 21:55
  • Where didyou define the pipeline? I don't see any custom_setting in SapoSpider – mirhossein Jun 27 '18 at 01:39
  • @mirhossein I did not realize what you meant. Can you explain better? – Joana Jun 27 '18 at 13:09
  • @Joana That error can't come from connecting to the DB. It's complaining about a query. You should try to print a backtrace rather than just exiting the program, so you can see where it's coming from. – Barmar Jun 27 '18 at 15:14
  • @Barmar I tried to rephrase the whole code and now give me this error that I posted in this link: " https://stackoverflow.com/questions/51062223/python-sqlalchemy-exc-operationalerror-unprintable-operationalerror-object". Can you help me? – Joana Jun 27 '18 at 15:41
  • @Joana You must define desired pipelines in custom_settings variable of SapoSpider, but there isn't such variable – mirhossein Jun 27 '18 at 16:30
  • @mirhossein Sorry, but I can not understand what you mean. Can you give me an example? – Joana Jun 27 '18 at 16:34
  • @Joana here https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider at ITEM_PIPELINES section – mirhossein Jun 27 '18 at 18:11
  • 1
    @Joana the problem is that your values are in list format not string, the `extract` return a list no a string, to make sure just replace the item with literals and see the result – mirhossein Jun 27 '18 at 19:16

0 Answers0