0

I'm trying to parse an xml file using beautifulsoup4. IDE : LICLIPSE Python version: 2.7 xml encoding : utf-8

Sample xml file : http://pastebin.com/RhjvyKDN

Below is the code I used to parse the xml files and write the extracted information to a local mysql database.

from bs4 import BeautifulSoup
import pymysql
import os, os.path

#strips apostrophes from the text and then just adds them at the beginning and end for the query
def apostro(text):

    text= text.replace("'","")
    text= text.replace(",","")
    text = "'"+text+"'"
    return text

#sets up the MYSQL connection
conn = pymysql.connect(host='127.0.0.1',  user='xxxx', passwd='xxxx', db='mysql', port= 3306 )
cur = conn.cursor()

#drop all of the previous values from the database
cur.execute("DELETE FROM db WHERE title is not null")
conn.commit()

#loop through all of the files
for root, _, files in os.walk("C:/usc/xml"):
    for f in files:
        #j is a counter for how many sections we have processed
        j=0
        #fullpath is the location of the file we're parsing
        fullpath = os.path.join(root, f)
        print(fullpath)

        #open file using BeautifulSoup
        soup = BeautifulSoup(open(""+fullpath+""), 'xml')
        sec = soup.find_all("section", {"style" : "-uslm-lc:I80"})
        t = soup.main.title
        t_num = t.num['value']

        #if not clauses are needed in case there is a blank, otherwise an error is thrown
        if not t.heading.text:
            t_head = ''
        else:
            t_head = t.heading.text.encode('ascii', 'ignore').encode("UTF-8")

        for element in sec:
            if not element.num['value']:
                section = ''
            else:
                section = element.num['value'].encode('ascii', 'ignore').encode("UTF-8")

            if not element.heading:
                s_head = ''
            else:
                s_head = element.heading.text.encode('ascii', 'ignore').encode("UTF-8")

            if not element.text:
                s_text = ''
            else:
                s_text = element.text.encode('ascii', 'ignore').encode("UTF-8")

            #inserttest is the sql command that 'cur' executes. counter is printed every time a section is written to let me know the program is still alive
            inserttest = "insert into deadlaws.usc_new (title, t_head, section, s_head, s_text) values (" + t_num + "," + apostro(t_head) + "," + apostro(section) + "," + apostro(s_head) + "," + apostro(s_text) +")"
            j=j+1
            cur.execute( inserttest)
            conn.commit()
            print(fullpath + " " +str(j))

conn.commit()
cur.close()
conn.close()

Everything went well until I noticed that the program ignores the hyphens '-' in the section numbers which makes the entire activity wrong.

I know I have used 'ignore' in the encode statement, but a hyphen '-' is a legitimate character in ascii, right? Shouldn't it be writing the character to the db instead of ignoring it?

I did a lot of reading on SO and elsewhere.

I've tried including from_encoding="utf-8" in the soup statement, 'xmlrefreplace' in the encode() statement and other methods, which have resulted in the below output : it writes this a– (some special unicode character) instead of a hyphen '-' to the database.

Sample output:

The data is huge and I'm afraid there could be other characters like - that are being ignored by the program. It's ok if it ignores special characters from the t_head, s_head and s_text fields as they are text but not the section column.

Any help in resolving this issue would be greatly appreciated.

j-v
  • 43
  • 2
  • 6
  • Why encode to ASCII (ignoring anything that doesn't fit in ASCII), then re-encode to UTF-8? ASCII is a subset of UTF-8 *anyway*. – Martijn Pieters May 26 '14 at 20:16
  • You most likely have EN DASH or EM DASH characters, which are not ASCII characters. – Martijn Pieters May 26 '14 at 20:16
  • [EN DASH](http://codepoints.net/U+2013) encoded to UTF-8 then mis-interpreted as Latin 1 (mojibake) is `–`, [EM DASH](http://codepoints.net/U+2014) misinterpreted is `—` (direction of special quotes differs). – Martijn Pieters May 26 '14 at 20:18
  • @MartijnPieters Thanks for the response. I think the same, because in one of the errors I got during trial and error of tweaking the encoding code, it said Latin-1 encoding error. How can I fix this? – j-v May 26 '14 at 20:20
  • You should be using SQL parameters instead, and have the MySQL driver handle the encoding here. It could also be that your query tool is not using the correct encoding to *read* the data. – Martijn Pieters May 26 '14 at 20:21
  • Regarding the encoding and re-encoding : I honestly am not an expert in this.. I just used it after I read it on the internet and it seemed to work. Hence went on with it. – j-v May 26 '14 at 20:22
  • I have set the database encoding to utf-8 default. Is there something else I have to do – j-v May 26 '14 at 20:23
  • There is a lot of misinformation on Unicode and encoding on the internet. You basically want to leave the decoding and encoding to your libraries here; BeautifulSoup handles parsing the XML to Unicode values, leave the encoding to the MySQL library. – Martijn Pieters May 26 '14 at 20:24
  • You want to use [SQL parameters](http://stackoverflow.com/questions/775296/python-mysql-with-variables), pass in *Unicode*, not encoded values. – Martijn Pieters May 26 '14 at 20:25
  • Also see [Python & MySql: Unicode and Encoding](http://stackoverflow.com/q/8365660) – Martijn Pieters May 26 '14 at 20:25
  • @MartijnPieters adding to the mysql parameters list did the trick for me. Thanks a lot for the above references. You rock! I wish I had enough reputation to upvote your comments. – j-v May 26 '14 at 20:39
  • You don't need to upvote my comments; I've formulated an answer below instead. – Martijn Pieters May 26 '14 at 20:40

1 Answers1

0

Don't encode, the MySQL library is perfectly capable of inserting Unicode text into the database directly. Use SQL parameters, not string interpolation, and specify the character set to use when connecting to the database:

conn = pymysql.connect(host='127.0.0.1', user='xxxx', passwd='xxxx',
                       db='mysql', port=3306,
                       charset='utf8')

Don't encode:

t_head = t.heading.text or ''

for element in sec:
    if not element.num['value']:
        section = ''
    else:
    section = element.num.get('value', '')
    s_head = element.heading.text or ''
    s_text = element.text or ''

    inserttest = "insert into deadlaws.usc_new (title, t_head, section, s_head, s_text) values (?, ?, ?, ?)"
    cur.execute(inserttest, (t_num, t_head, section, s_head, s_text))
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343