1

I am having trouble getting the encoding/decoding working properly with an application that I am developing with the following components:

  1. Python 3.6
  2. BeautifulSoup
  3. Scraped web page using UTF-8
  4. MySQL
  5. json
  6. Lambda

When I get the data into the front-end (Alexa), it includes the unicode characters (e.g., \u00e2\u0080\u0099) in some cases. Any help would be greatly appreciated!!!

Here are code snippets from the entire pipeline:

The original web page is: Checked document.characterSet in Chrome Developer Tools Webpage Character Encoding

I am scraping with this Python/BeautifulSoup code:

from bs4 import BeautifulSoup
import pymysql
    if page_response.status_code == 200:
        page_content = BeautifulSoup(page_response.content, "html.parser")    
        if str(page_content.find(attrs={'id': 'main'})).find(page_test) != -1:
            for table_row in page_content.select("div#page_filling_chart center table tr"):
                cells = table_row.findAll('td')
                if cells:
                    records += 1
                    bo_entry.title = cells[2].text.strip()

Putting data into the database with this:

connection = pymysql.connect(
        host=rds_host,
        user=name,
        password=password,
        db=db_name
        )
    try:
        with connection.cursor() as cursor:
            # UPSERT: https://chartio.com/resources/tutorials/how-to-insert-if-row-does-not-exist-upsert-in-mysql/
            sql = (
                    f"REPLACE INTO weekend_box_office(weekend_date, market, title_id, title,gross,total_gross,rank_order, previous_rank, distributor, distributor_id, change_pct, theaters, per_theater, week_in_release, gross_num, total_gross_num)"
                    f"VALUE(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s);"
                )
            data = (
                    bo_entry.weekend, bo_entry.market, bo_entry.title_id, bo_entry.title, bo_entry.gross, bo_entry.total_gross, 
                    bo_entry.rank, bo_entry.previous_rank, bo_entry.distributor, bo_entry.distributor_id, bo_entry.change_pct, bo_entry.theaters,
                    bo_entry.per_theater, bo_entry.weeks_in_release, bo_entry.gross_num, bo_entry.total_gross_num
                    )
#            print(sql)

The current database collation and character set are set to: Database Character Set

The MySQL Table Collation that the data is stored in is this: Table Character Encoding

I fetch data from the database using this Python 3.6 code:

connection = pymysql.connect(
        host=rds_host,
        user=name,
        password=password,
        db=db_name
        )

        with connection.cursor() as cursor:
            sql = (
                    f"select weekend_date, title_id, title, gross, gross_num, total_gross, total_gross_num, CONCAT(cast(ROUND(gross_num / total_gross_num * 100,1) as CHAR),'%') as weekend_pct, week_in_release "
                    "from weekend_box_office "
                    f"where weekend_date = '{weekend_text}' "
                    f"order by gross_num desc limit {limit_row_no}; "
                )
            try:
                cursor.execute(sql)
                result = cursor.fetchall()              
                for row in result:
                    title = row[2]

This is what it looks like when I place a breakpoint and exam it in the Variable Explorer of Spyder. Variable Explorer

When I return it, it looks like this: Returned as Str

using this code: response_text += ( f"led by {title} pulling in ${SpeechUtils.spoken_human_format(gross_num)}. " ) return response_text

When I return it from Lambda using the json Python library, it looks like this: Output from JSON Module return { 'statusCode': 200, 'body': json.dumps(speak_top5(BoxOffice.get_previous_friday())), 'headers': { 'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*' },

JasonBub
  • 173
  • 3
  • 15
  • NOTE: I was doing some more research and did read this article and found that adding the parameter "ensure_ascii=False" to json.dumps(x, ensure_ascii=False) help. There is still an issue with the accent characters, but it removes the Unicode symbols. – JasonBub Feb 20 '19 at 02:17

1 Answers1

0

Try after changing the mysql connection charset to charset='utf8'.

connection = pymysql.connect(
    host=rds_host,
    user=name,
    password=password,
    db=db_name,
    charset='utf8'
    )

See the detailed information from here

Saji
  • 1,374
  • 1
  • 12
  • 19