3

Pardon my ugly newb code, I'm learning. I'm pulling movie data from OMDB API, but when I move it to CSV I get UnicodeEncodeError for many films. Likely because actor names have accents, for instance. I want to 1.) Identify which films are problematic, 2.) skip them, and/or 3.) preferably correct the error. What I have currently just passes the whole thing when an error occurs. Looking for a simple fix, since I'm novice.

import csv
import os
import json
import omdb

movie_list = ['A Good Year', 'A Room with a View', 'Anchorman', 'Amélie', 'Annie Hall', 'Before Sunrise']

data_list = []

textdoc = open('textdoc.txt','w')

for w in movie_list:
    x = omdb.request(t=w, fullplot=True, tomatoes=True, r='json')
    y = x.content
    z = json.loads(y)
    data_list.append([z["Title"], z["Year"], z["Actors"], z["Awards"], z["Director"], z["Genre"], z["Metascore"], z["Plot"], z["Rated"], z["Runtime"], z["Writer"], z["imdbID"], z["imdbRating"], z["imdbVotes"], z["tomatoRating"], z["tomatoReviews"], z["tomatoFresh"], z["tomatoRotten"], z["tomatoConsensus"], z["tomatoUserMeter"], z["tomatoUserRating"], z["tomatoUserReviews"]])

try:
    with open('Films.csv', 'w') as g:
        a = csv.writer(g, delimiter=',')
        a.writerow(["Title", "Year", "Actors", "Awards", "Director", "Genre", "Metascore", "Plot", "Rated", "Runtime", "Writer", "imdbID", "imdbRating", "imdbVotes", "tomatoRating", "tomatoReviews", "tomatoFresh", "tomatoRotten", "tomatoConsensus", "tomatoUserMeter", "tomatoUserRating", "tomatoUserReviews"])
        a.writerows(data_list)
except UnicodeEncodeError:
    print("fail")
A_S00
  • 225
  • 2
  • 15
Kees
  • 451
  • 1
  • 8
  • 17
  • 1
    just a note, if you did `csv_fields = ["Title", "Year", .. etc.]` then your `data_list.append` could be simplified to `data_list.append([z[field] for field in csv_fields])` and the csv headers just `a.writerow(csv_fields)` – Tadhg McDonald-Jensen May 27 '16 at 19:17

4 Answers4

7

Python 2.x:Instead of with open("Films.csv", 'w') as g: you could try to use codecs in order to open the csv output as UTF-8 encoding.

import codecs
with codecs.open('Films.csv', 'w', encoding='UTF-8') as g:
# rest of code

Python 3.x: try opening g with UTF-8 encoding:

with open('Films.csv', 'w', encoding='UTF-8') as g:
# rest of code.
Cory Shay
  • 1,204
  • 8
  • 12
  • I assume your first example should be using `codecs.open`? – Tadhg McDonald-Jensen May 27 '16 at 18:59
  • Thanks. However I still get the error: 'ascii' codec can't encode character u'\xe9' in position 2: ordinal not in range(128) – Kees May 28 '16 at 03:05
  • @Kees is this occurring with the above code and if so what line? It may be caused by `json.loads` which you should be able to pass `encoding="utf-8"` as a parameter. – Cory Shay May 28 '16 at 03:39
1

try out smart_str

from django.utils.encoding import smart_str
data_list.append(map(smart_str, [z['element1'], z['element2']]))
a.write_row(map(smart_str, ["Title", "Year", "Actors", "Awards", "Director", "Genre", "Metascore", "Plot", "Rated", "Runtime", "Writer", "imdbID", "imdbRating", "imdbVotes", "tomatoRating", "tomatoReviews", "tomatoFresh", "tomatoRotten", "tomatoConsensus", "tomatoUserMeter", "tomatoUserRating", "tomatoUserReviews"]))
a.write_rows(data_list)
minocha
  • 1,043
  • 1
  • 12
  • 26
  • `map(lambda x: smart_str(x), ...)` could be replaced with just `map(smart_str, ...)` – Tadhg McDonald-Jensen May 27 '16 at 19:06
  • @TadhgMcDonald-Jensen good catch :) was making it cluttered.. made the suggested changes – minocha May 27 '16 at 19:08
  • 1
    @Kees as once suggested to me, Say [Hello to Unicode](https://kos.gd/posts/say-hello-to-unicode/) :) – minocha May 27 '16 at 19:11
  • Thanks, will read up on Unicode. However, oddly I get a syntax error at: `with open('BritAir_100Films.csv', 'w') as g:` – Kees May 28 '16 at 03:10
  • can you add another `except Exception, e: print e` after your current `except Unicode..` statement and say what the error is? it looks to me something small wrong in the code you're running the line should run perfectly – minocha May 28 '16 at 04:05
  • 1
    @minocha Thanks, I've got it working using your code! I was missing a parenthesis (among other things) but ultimately worked. – Kees May 28 '16 at 18:29
  • @minocha, thanks I have. But it says I need 15 rep points before I can affect the publicly displayed score. – Kees May 30 '16 at 04:38
  • @Kees np :) You can still accept my answer with the 'tick' if this solution worked for you. http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – minocha May 30 '16 at 07:27
0

If using Python 2, csvwriter doesn't really support Unicode, but there is an example in the csv documentation to work around it. An example is in this answer.

If using Python 3, then make the following changes:

y = x.content.decode('utf8')

and

with open('Films.csv', 'w', encoding='utf8',newline='') as g:

With these changes text is decoded to Unicode for processing within the Python script, and encoded back to UTF-8 when written to a file. This is the recommended way to deal with Unicode.

newline='' is the correct way to open a file for csv use. See this answer and the csv docs.

You can remove the try/except as well. It just suppresses useful tracebacks.

Community
  • 1
  • 1
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
-1

The solution that works for me is to add at the beginning of the export procedure:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
Ether
  • 1
  • 1