2

i'm trying to create a news app for a schoolproject where i get information off rss feeds of my local newspapers, in order to combine multiple newspapers into one.

i'm running into problems when i try to insert my collected data into my Mysql database.

When i simply print my date (example: print urlnzz.entries[0].description) there is no problem with the german characters such as ü ä ö é à.

when i try to insert the data into the Mysql databse however, I get "UnicodeEncodeError: 'ascii' codec can't encode character..". Weird is, that this only happens for .title and .description, not for .category (even though there are also ü etc in there)

i've been looking for an answer for quite some time now, i changed the encoding of the variables with

t = urlbernerz.entries[i].title


print t.encode('utf-8')

changed the charset to utf-8 when i connect to the database and even tried the "try / except " function of python, yet nothing seems to work.

I've checked the type of each entry with type(u['entries'].title) and they are all unicode, now i need to encode them in a way that i can put them into my mysqldatabase

on the rss websites it states that it's already encoded as utf-8, and even though i explicitly tell python to encode it as utf-8 as well, it still gives me the error:'ascii' codec can't encode character u'\xf6'

i've tried many answer to this subject already, such as using str() or using chardet but nothing seem to work. Here's my code

import MySQLdb
import feedparser
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

db = MySQLdb.connect(host="127.0.0.1", 
                     user="root",
                      passwd="",
                      db="FeedStuff",
                     charset='UTF8')
db.charset="utf8"
cur = db.cursor()




urllistnzz =['international', 'wirtschaft', 'sport']
urllistbernerz =['kultur', 'wissen', 'leben']


for u in range (len(urllistbernerz)):
    urlbernerz = feedparser.parse('http://www.bernerzeitung.ch/'+urllistbernerz[u]+'/rss.html')
    k = len(urlbernerz['entries'])
    for i in range (k):
        cur.execute("INSERT INTO articles (title, description, date, category, link, source) VALUES (' "+ str(urlbernerz.entries[i].title)+"  ', ' " + str(urlbernerz.entries[i].description)+ " ', ' " + urlbernerz.entries[i].published + " ', ' " + urlbernerz.entries[i].category + " ', ' " + urlbernerz.entries[i].link + " ',' Berner Zeitung')")

for a in range (len(urllistnzz)):
    urlnzz = feedparser.parse('http://www.nzz.ch/'+urllistnzz[a]+'.rss')
    k = len(urlnzz['entries'])
    for i in range (k):
        cur.execute("INSERT INTO articles (title, description, date, category, link, source) VALUES (' "+str(urlnzz.entries[i].title)+" ', ' " + str(urlnzz.entries[i].description)+ " ', ' " + urlnzz.entries[i].published + " ', ' " + urlnzz.entries[i].category + " ', ' " + urlnzz.entries[i].link + " ', 'NZZ')")



db.commit()

cur.close()
db.close()
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
Sascha
  • 95
  • 1
  • 1
  • 7
  • unrelated: don't hardcode the encoding of outside environment (terminal) inside your script, print Unicode instead: `print t` – jfs Sep 23 '15 at 22:38
  • have you tried `use_unicode=True` connect() parameter? Again, don't encode, pass Unicode string -- let the db driver to encode using the correct encoding (specified via `charset` parameter earlier). – jfs Sep 23 '15 at 22:39
  • unrelated: don't use string formatting to insert sql values, use parametrized queries instead. – jfs Oct 20 '15 at 10:22

3 Answers3

1

It is possible that there are characters with other encodings present in the text from RSS feeds. First, you can try different encodings in nested try except blocks. Secondly you can add 'ignore' to the encode methods. Like:

try:
    s = raw_s.encode('utf-8', 'ignore')
except UnicodeEncodeError:
    try:
        s = raw_s.encode('latin-1', 'ignore')
    except UnicodeEncodeError:
        print raw_s

Hope this helps.

haraprasadj
  • 1,059
  • 1
  • 8
  • 17
0

Assuming cur.execute() expects a utf-8 encoded string: you need to encode it as utf-8 explicitly when you pass it to MySQL, just doing str() will attempt to encode it as ascii which fails and produces your error:

   cur.execute("INSERT INTO articles (title, description, date, \
   category, link, source) VALUES ('"+ \
   urlnzz.entries[i].title.encode('utf-8') +" ', ' " + \
   urlnzz.entries[i].description.encode('utf-8') + " ', ' " +  \
   urlnzz.entries[i].published + " ', ' " +  \
   urlnzz.entries[i].category + " ', ' " + urlnzz.entries[i].link + " ', 'NZZ')")

Being a unicode object is something distinct from being a str in utf-8 encoding. The encode method on a unicode object will produce a utf-8 formatted str (assuming Python 2)

proycon
  • 495
  • 3
  • 9
  • 1
    This is wrong. You should pass Unicode strings to `.execute()`. The driver will encode where necessary: http://stackoverflow.com/a/6203782/1554386 – Alastair McCormack Sep 24 '15 at 10:08
0

The major issue is that you're calling str() on Unicode objects. Depending on many factors, this may result in Python trying to encode the Unicode into ASCII, which is not possible with non-ASCII chars.

You should try to keep Unicode objects as Unicode objects for as long as possible in your code and only convert when it's totally necessary. Fortunately, the MySQL driver is Unicode compliant, so you can pass it Unicode strings and it will encode internally. The only thing you need to do is to tell the driver to use UTF-8. Feedparser is also Unicode compliant and is decoding the rss feed automatically to Unicode strings (strings without encoding).

There's also some parts of your code, which would benefit from using Python's in built features like for each in something:, String.format(), and triple quotes (""") for long pieces of text.

Pulling this all together looks like:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import MySQLdb
import feedparser

db = MySQLdb.connect(host="127.0.0.1",
                     user="root",
                      passwd="",
                      db="FeedStuff",
                     charset='UTF8')

urllistnzz =['international', 'wirtschaft', 'sport']
urllistbernerz =['kultur', 'wissen', 'leben']

cur = db.cursor()

for uri in urllistbernerz:
    urlbernerz = feedparser.parse('http://www.bernerzeitung.ch/{uri}/rss.html'.format(uri=uri))

    for entry in urlbernerz.entries:
        insert_sql = u"""INSERT INTO articles (title, description, date, category,
                        link, source) VALUES ("{e.title}", "{e.description}",
                        "{e.published}", "{e.category}", "{e.link}", "Berner Zeitung")
                        """.format(e=entry)

        cur.execute(insert_sql)

for uri in urllistnzz:
    urlnzz = feedparser.parse('http://www.nzz.ch/{uri}.rss'.format(uri=uri) )

    for entry in urlnzz.entries:
        insert_sql = u"""INSERT INTO articles (title, description, date, category,
                        link, source) VALUES ("{e.title}", "{e.description}",
                        "{e.published}", "{e.category}", "{e.link}", "NZZ")
                        """.format(e=entry)

        cur.execute(insert_sql)

db.commit()

cur.close()
db.close()
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • This worked! thanks a lot, i'll have to figure out exactly what you changed with the "uri" and .format(uri=uri) because i need to document both the coding and the theoretical background to it in my school work, so i'll do some research now :) – Sascha Sep 24 '15 at 16:31
  • hey, i just had to start using this, and it turns out that the solution you gave me doesn't give me any errors anymore, but it also doesn't show me all the articles i want. it also confuses things such as the link and messes a lot of things up, now that i start to use this in further code... are you sure this is supposed to work? – Sascha Oct 15 '15 at 21:53
  • Yes, this code is supposed to work. You're going to have to be more specific about what isn't working and make sure it's not because your 3rd party website have changed. – Alastair McCormack Oct 15 '15 at 22:28
  • I added some more detail in the question above, it was too big for a comment. even when i put the counter directly in to the code which you suggested, the two numbers don't match, and the mixing of information that i get in my db is really strange. thanks for your time btw :D – Sascha Oct 15 '15 at 23:32
  • When you have these kind of problems, you ought to try to debug your code. Ask yourself "How could I end up with articles attributed to the wrong source?". I quickly found that I left a typo in my code - a second iteration of `for entry in urlbernerz.entries:` instead of `for entry in urlnzz.entries:`. The code above is now fixed. It'd be wise to understand the `for x in iteratable` syntax – Alastair McCormack Oct 16 '15 at 09:54
  • don't use string formatting to insert sql values, use parametrized queries instead. – jfs Oct 20 '15 at 10:23
  • I agree @J.F.Sebastian but I wanted to introduce the OP to some Python basics. I'll update the code... – Alastair McCormack Oct 20 '15 at 12:57