1

I have this code:

#!/usr/local/bin/python
# -*- coding: utf-8 -*-

import re
import urllib2
import BeautifulSoup
import csv

origin_site = 'http://typo3.nimes.fr/index.php?id=annuaire_assos&theme=0&rech=&num_page='

get_url = re.compile(r"""window.open\('(.*)','','toolbar=0,""", re.DOTALL).findall

pages = range(1,2)

for page_no in pages:
    req = ('%s%s' % (origin_site, page_no))
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = { 'User-Agent' : user_agent }
    try:
        urllib2.urlopen(req)
    except urllib2.URLError, e:
        pass 
    else:
        # do something with the page
        doc = urllib2.urlopen(req)
        soup = BeautifulSoup.BeautifulSoup(doc)
        infoblock = soup.findAll('tr', { "class" : "menu2" })
        for item in infoblock:
            assoc_data = []
            soup = BeautifulSoup.BeautifulSoup(str(item))
            for tag in soup.recursiveChildGenerator():
                if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('td'):
                    if tag.string is not None:
                        assoc_name = (tag.string)
                if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('u'):
                    if tag.string is not None:
                        assoc_theme = (tag.string)

            get_onclick = str(soup('a')[0]['onclick']) # get the 'onclick' attribute
            url = get_url(get_onclick)[0]

            try:
                urllib2.urlopen(url)
            except urllib2.URLError, e:
                pass 
            else:
                assoc_page = urllib2.urlopen(url)
                #print assoc_page, url
                soup_page = BeautifulSoup.BeautifulSoup(assoc_page)
                assoc_desc = soup_page.find('table', { "bgcolor" : "#FFFFFF" })
                #print assoc_desc
                get_address = str(soup_page('td', { "class" : "menu2" }))
                soup_address = BeautifulSoup.BeautifulSoup(get_address)
                for tag in soup_address.recursiveChildGenerator():
                    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a'):
                        if tag.string is not None:
                            assoc_email = (tag.string)
                assoc_data.append(assoc_theme)
                assoc_data.append(assoc_name)
                assoc_data.append(assoc_email)
                for tag in soup_address.recursiveChildGenerator():
                    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('td'):
                        if tag.string is not None:
                            if tag.string != ' ':
                                get_string = BeautifulSoup.BeautifulSoup(tag.string)
                                assoc_data.append(get_string)
                                #data.append(get_string)

            c = csv.writer(open("MYFILE.csv", "wb"))
            for item in assoc_data:
                c.writerow(item)

but get this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 0: ordinal not in range(128)

How do I pass french characters into the MYFILE.csv file? And can I improve the code further?

gnat
  • 6,213
  • 108
  • 53
  • 73
khinester
  • 3,398
  • 9
  • 45
  • 88

3 Answers3

3

It looks like the results from urllib2 are unicode but the CSV module isn't Unicode compatible but is 8 bit compatible.

Instead, you have to convert each string to UTF-8 before you write it. Eg:

       c = csv.writer(open("MYFILE.csv", "wb"))
       for item in assoc_data:
         # Ensure item is an object and not an empty unicode string
         if item and item != u'':
           c.writerow([item.encode("UTF-8")])
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • i get this error: File "/usr/local/Cellar/python/2.7.2/lib/python2.7/codecs.py", line 691, in write return self.writer.write(data) File "/usr/local/Cellar/python/2.7.2/lib/python2.7/codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 78: ordinal not in range(128) – khinester Oct 14 '12 at 22:30
  • sorry, this is the error now: Traceback (most recent call last): File "nimes_extract.py", line 74, in c.writerow(item.encode("UTF-8")) TypeError: 'NoneType' object is not callable – khinester Oct 14 '12 at 22:35
  • I also made a mistake, you need to make `item` a list. Bear in mind though in your code, you're only writing one item per line. – Alastair McCormack Oct 14 '12 at 22:59
  • The new error you have is because one of your entries is blank/null/None. To work around this, you have to test if `item` is a valid string and is not none . See my 2nd edit – Alastair McCormack Oct 14 '12 at 23:00
  • @khinester - The objects you're adding to `assoc_data` are actually BeautifulSoup objects and not Unicode strings. The final append seems to add an object that returns a None object when the text is empty. I've changed this line to return a Unicode string instead. The next problem is that the normal `if item` doesn't return false on empty Unicode strings so you have to do `if item != u''` instead. See http://pastie.org/5061056 for updates to your code. You may want to convert all assoc_data objects to Unicode before writing to be on the safe side. – Alastair McCormack Oct 15 '12 at 09:28
3

Scroll to the bottom: http://docs.python.org/library/csv.html

specifically, use this writer:

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

Then, instead of

c = csv.writer(open("MYFILE.csv", "wb"))

use

c = UnicodeWriter(open("MYFILE.csv", "wb"))
Skylar Saveland
  • 11,116
  • 9
  • 75
  • 91
3

the issue was that i was not using the unicode correctly, here is the latest code

#!/usr/local/bin/python
# -*- coding: utf-8 -*-

import urllib2
import BeautifulSoup
import csv

origin_site = 'http://typo3.nimes.fr/index.php?id=annuaire_assos&theme=0&rech=&num_page='

pages = range(1,21)

assoc_table = []

for page_no in pages:
    print page_no
    req = ('%s%s' % (origin_site, page_no))
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = { 'User-Agent' : user_agent }
    try:
        doc = urllib2.urlopen(req)
    except urllib2.URLError, e:
        pass 
    else:
        # do something with the page    
        soup = BeautifulSoup.BeautifulSoup(doc)
        for row in soup.findAll('tr', { "class" : "menu2" }):
            assoc_data = []
            item = row.renderContents()
            soup = BeautifulSoup.BeautifulSoup(item)
            # we get the Thème
            for assoc_theme in soup.findAll('u'):
                assoc_data.append(assoc_theme.renderContents())
            # we get the Nom de l'association
            for assoc_name in soup.findAll('td', { "width": "70%"}):
                assoc_data.append(assoc_name.renderContents())
            # we list all the links to the indivudual pages
            for i in soup.findAll('a', {'href':'#'}):
                if 'associations' in i.attrMap['onclick']:
                    req = i.attrMap['onclick'].split('\'')[1]
                    try:
                        doc = urllib2.urlopen(req)
                    except urllib2.URLError, e:
                        pass
                    else:
                        soup = BeautifulSoup.BeautifulSoup(doc)
                        emails = []
                        web_sites = []
                        for tag in soup.recursiveChildGenerator():
                            if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a'):
                                assoc_link = (tag.string)
                                if '@' in str(assoc_link):
                                    print assoc_link
                                    emails.append(assoc_link)
                        if emails != []:
                            assoc_data.append(emails[0])
                        else:
                            assoc_data.append('pas du email')
                        for tag in soup.recursiveChildGenerator():
                            if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a'):
                                assoc_link = (tag.string)
                                if 'http' in str(assoc_link):
                                    web_sites.append(assoc_link)
                            #
                        if web_sites != []:
                            assoc_data.append(web_sites[0])
                        else:
                            assoc_data.append('pas du site web')
                        assoc_addr = [] 
                        assoc_cont = soup.findAll('td', { "width" : "49%", "class": "menu2" })
                        for i in assoc_cont:
                            assoc_addr.append(i.renderContents())
                        assoc_tels = []
                        for addr in assoc_addr:
                            assoc_data.append(addr)
                        assoc_tel = soup.findAll('td', { "width" : "45%", "class": "menu2" })
                        for i in assoc_tel:
                            assoc_tels.append(i.renderContents())
                        assoc_data.append(assoc_tels[0])
                        print assoc_tels[0]
            assoc_table.append(assoc_data)
            print assoc_data
print assoc_table
c = csv.writer(open("nimes_assoc.csv", "wb"))
for item in assoc_table:
    #print item
    c.writerow(item)

thanks for all your help and from tutor@python.org mailing list

khinester
  • 3,398
  • 9
  • 45
  • 88
  • I'm glad you've resolved it in your own way but just be careful about the approach you've taken. The reason why it's working is because you're actually avoiding Unicode! renderContents() is returning the – Alastair McCormack Oct 15 '12 at 19:14
  • can you elaborate? i see that when adding a print on the type for assoc_data.append(assoc_theme.renderContents()) i get a but when i print assoc_table, the list returned have entries like ['Rapatri\xc3\xa9s', 'Amicale des oraniens du....] so the assoc_data.append(assoc_theme.renderContents()) adds the assoc_theme with the correct encoding! – khinester Oct 15 '12 at 21:25
  • doh! Sorry, I started writing that comment but didn't mean to submit it as I realised I was nit picking. BeautifulSoup is properly detecting the char set of the original document (iso-8859-1) and returning UTF8 encoded strings which is fine for your needs. – Alastair McCormack Oct 15 '12 at 22:00