-2

I'm trying to parse a Russian web-site (in Cyrillic) and insert data to a mySQL DB. The parsing is fine, but I can't save the data in the DB because of the Cyrillic letters. Python give me this error:

Traceback (most recent call last):
  File "/Users/kr/PycharmProjects/education_py/vape_map.py", line 40, in <module>
    print parse_shop_meta()
  File "/Users/kr/PycharmProjects/education_py/vape_map.py", line 35, in parse_shop_meta
    VALUES (%s, %s, %s, %s)""",(shop_title, shop_address, shop_phone, shop_site, shop_desc))
  File "/Library/Python/2.7/site-packages/MySQLdb/cursors.py", line 210, in execute
    query = query % args
TypeError: not all arguments converted during string formatting

My code:

# -- coding: utf-8 --
import requests
from lxml.html import fromstring
import csv
import MySQLdb


db = MySQLdb.connect(host="localhost", user="root", passwd="***", db="vape_map", charset='utf8')

def get_shop_urls():
    i = 1
    all_shop_urls = []
    while i < 2:
        url = requests.get("http://vapemap.ru/shop/?city=%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0&page={}".format(i))
        page_html = fromstring(url.content)
        shop_urls = page_html.xpath('//h3[@class="title"]/a/@href')
        all_shop_urls += shop_urls
        i +=1
    return all_shop_urls

def parse_shop_meta():
    shops_meta = []
    csvfile = open('vape_shops.csv', 'wb')
    writer = csv.writer(csvfile, quotechar='|', quoting=csv.QUOTE_ALL)
    cursor = db.cursor()
    for shop in get_shop_urls():
        url = requests.get("http://vapemap.ru{}".format(shop), 'utf-8')
        page_html = fromstring(url.content)
        shop_title = page_html.xpath('//h1[@class="title"]/text()')
        shop_address = page_html.xpath('//div[@class="address"]/text()')
        shop_phone = page_html.xpath('//div[@class="phone"]/a/text()')
        shop_site = page_html.xpath('//div[@class="site"]/a/text()')
        shop_desc = page_html.xpath('//div[@class="shop-desc"]/text()')
        sql = """INSERT INTO vape_shops(title, address, phone, site, description)
            VALUES (%s, %s, %s, %s)""",(shop_title, shop_address, shop_phone, shop_site, shop_desc)
        cursor.execute(sql, (shop_title[0], shop_address[0], shop_phone[0], shop_site[0], shop_desc[0]))
        db.commit()
    db.close()
    return shops_meta

print parse_shop_meta()
DrakaSAN
  • 7,673
  • 7
  • 52
  • 94
Konstantin Rusanov
  • 6,414
  • 11
  • 42
  • 55
  • 1
    For debugging purposes, I suggest you print `url.encoding` after your `GET` request to see if the guess made by the `requests` module is in fact a good one. Otherwise, you may want to change the encoding with `url.encoding = 'YOUR ENCODING HERE'`. Also, you can try and `decode` the strings you're using in `sql` query. Finally, are you sure you're formatting that `sql` query properly? It looks like you're creating a tuple. – Abdou Aug 12 '16 at 11:10
  • Try `unicode(your_string, encoding='utf-8', errors='ignore')` for all of the things that you're inserting into your db. – gr1zzly be4r Aug 12 '16 at 12:58
  • Abdou, i saw error about tuple. My MYSQL skills not so good. Can you help to my with this code? I need to insert shop (on cyrillic ) in my DB – Konstantin Rusanov Aug 12 '16 at 13:14
  • @KonstantinRusanov please provide a full stacktrace – Alastair McCormack Aug 12 '16 at 14:35
  • Alastair McCormack, added – Konstantin Rusanov Aug 12 '16 at 15:07
  • @KonstantinRusanov take a look at [this](https://www.dropbox.com/s/lcfcc6v0e8f98dg/vapeshops.py?dl=0). It could be a good starting point. I retouched your original script a little bit. I also use `pymysql` instead of `mysqldb`. But it should not matter much, as long as you're able to connect to your database without any errors. – Abdou Aug 12 '16 at 15:41

1 Answers1

1

%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0 is the encoding for Москва, so that looks OK. But you also need to establish that utf8 will be used in the connection to MySQL. And specify that the target column is CHARACTER SET utf8.

More details and Python-specifics

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222