0

I'm trying to store some fields derived from a webpage in mysql table. The script that I've created can parse the data and store them in the table. However, as the username is non-english, the table stores the name as ????????? ????????? instead of Αθανάσιος Σουλιώτης.

Script I've tried with:

import mysql.connector
import requests
from bs4 import BeautifulSoup

link = 'https://stackoverflow.com/questions/67941060/web-scraper-update'

mydb = mysql.connector.connect(
  host="localhost",
  user="root",
  passwd = "",
  database="mydatabase",
  charset='utf8',
  use_unicode=True
)

mycursor = mydb.cursor()

mycursor.execute("DROP TABLE if exists webdata")
mycursor.execute("CREATE TABLE if not exists webdata (title VARCHAR(255), username VARCHAR(255), reputation VARCHAR(255))")

response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
post_title = soup.select_one("h1[itemprop='name'] > a").get_text(strip=True)
username = soup.select_one(".user-details > a").get_text(strip=True)
reputation = soup.select_one("span.reputation-score").get_text(strip=True)

print((post_title,username,reputation))

mycursor.execute(
    "INSERT INTO webdata (title,username,reputation) VALUES (%s,%s,%s)",
    (post_title,username,reputation)
)

mydb.commit()
mydb.close()

This is how the output are being printed in the console:

('Web scraper update', 'Αθανάσιος Σουλιώτης', '13')

The database stores the output as:

'Web scraper update', '????????? ?????????', '13'

How can I store non-english name in mysql table accordingly?

SMTH
  • 67
  • 1
  • 4
  • 17
  • https://stackoverflow.com/questions/3008918/how-to-store-non-english-characters Hope it will help –  Jun 12 '21 at 06:49

2 Answers2

1

Please read this and try again.

I added the commit on a new 3 lines.

mydb = mysql.connector.connect(

host="localhost",
  user="root",
  passwd = "",
  database="mydatabase",
  charset='utf8',
  use_unicode=True
)

mycursor = mydb.cursor()

// add below line 1)
mycursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % 'mydatabase')

mycursor.execute("DROP TABLE if exists webdata")
mycursor.execute("CREATE TABLE if not exists webdata (title VARCHAR(255), username VARCHAR(255), reputation VARCHAR(255))")

mycursor.execute('SET CHARACTER SET utf8;')             // <--- add this line   2)
mycursor.execute('SET character_set_connection=utf8;')  // <--- add this line   3)

response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")

I have already tested it.

Please make a commit after check.

Victory
  • 36
  • 1
  • 4
  • Thanks for your solution @Victory. How can I implement this within my existing script? – SMTH Jun 12 '21 at 07:12
0

θ symbol is 2-byte size with UTF-8 ( 0xCE 0x98 ). I am not sure but UTF-16 can solve this( 0x0398 ).