1

So I got error message:

Traceback (most recent call last):
  File "make.py", line 48, in <module>
    json.dump(amazon_review, outfile)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 313, in _iterencode_list
    yield buf + _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 173: invalid continuation byte

on these code:

with open('amazon_review.json', 'w') as outfile:
  json.dump(amazon_review, outfile)

I could figure it out. Any help will be great.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
user2372074
  • 781
  • 3
  • 7
  • 18
  • 3
    I think we'd need to know a little bit about the contents of `amazon_review` before we can help here too much... – mgilson Jun 03 '14 at 05:12
  • Have a look [at this question](http://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte); I won't mark it duplicate _yet_, but I have a feeling this is your problem. – Burhan Khalid Jun 03 '14 at 05:22

3 Answers3

1

Python 2 doesn't use the Unicode interfaces even though it returns Unicode strings, so it'll never read non-ANSI characters correctly.

So the attempt to .encode it fails with a Unicode​Decode​Error trying to get a Unicode string before encoding it back to ASCII.Try using this.

with open('amazon_review.json', 'w') as outfile:
    try:
        json.dump(amazon_review, outfile)# omit in 3.x!
    except UnicodeEncodeError:
        pass
Sarwar
  • 415
  • 4
  • 10
  • if you dont want to open the file that has non ANSI characters in it. But if you need it that you would have to parse the file and omit that character. – Sarwar Jun 03 '14 at 05:39
  • but yeah need a bit more details about your amazon_review – Sarwar Jun 03 '14 at 05:40
1

We probably need to know a bit more about the data you are passing into json.dump, but I know the api supports an encoding kwarg that defaults to utf-8.

Have you tried something like

with open('amazon_review.json', 'w') as outfile:
    json.dump(amazon_review, outfile, encoding="utf-16")

Might be worth it to look at this similar issue

Community
  • 1
  • 1
Alex
  • 1,993
  • 1
  • 15
  • 25
0

You have a byte string somewhere in the structure of amazon_review. You should make sure you write only Unicode strings to structures you intend to serialise with json.dump, because JSON can only represent Unicode-based strings (there is no concept of a byte string to match Python 2's str).

Python can cope with the mistake as long as the byte string contains only ASCII characters, because it's a good guess that whatever encoding that byte string represents is an ASCII superset. But for top-bit-set bytes like 0xEA it can't guess so you will have to tell it by calling .decode('whatever-encoding-it-is-in') on the byte string before passing the result into amazon_review.

If in your data 0xEA is supposed to represent U+00EA e-with-circumflex ê then the encoding to try would be either 'iso-8859-1' or 'windows-1252'.

bobince
  • 528,062
  • 107
  • 651
  • 834