4

I have a model class that looks like the following:

class Address(models.Model):
    # taking length of address/city fields from existing UserProfile model
    address_1 = models.CharField(max_length=128,
                                 blank=False,
                                 null=False)

    address_2 = models.CharField(max_length=128,
                                 blank=True,
                                 null=True)

    address_3 = models.CharField(max_length=128,
                                 blank=True,
                                 null=True)

    unit = models.CharField(max_length=10,
                            blank=True,
                            null=True)

    city = models.CharField(max_length=128,
                            blank=False,
                            null=False)

    state_or_province = models.ForeignKey(StateOrProvince)

    postal_code = models.CharField(max_length=20,
                                   blank=False,
                                   null=False)

    phone = models.CharField(max_length=20,
                             blank=True,
                             null=True)

    is_deleted = models.BooleanField(default=False,
                                     null=False)

    def __unicode__(self):
        return u"{}, {} {}, {}".format(
            self.city, self.state_or_province.postal_abbrev, self.postal_code, self.address_1)

The key being the __unicode__ method. I have a customer model that has a foreign key field to this table, and I am doing the following logging:

log.debug(u'Generated customer [{}]'.format(vars(customer)))

This works fine, but if an address_1 field value contains a non ascii value, say

57562 Vån Ness Hwy

the system is throwing the following exception:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 345: ordinal not in range(128)

I tracked this down to a strange method in django/db/models/base.py:

def __repr__(self):
        try:
            u = six.text_type(self)
        except (UnicodeEncodeError, UnicodeDecodeError):
            u = '[Bad Unicode data]'
        return force_str('<%s: %s>' % (self.__class__.__name__, u))

as you can see, this method is getting called to force_str, which doesn't get handled correctly. is this a bug? if unicode is getting called on my object, shouldn't everything be in unicode?

Nathan Tregillus
  • 6,006
  • 3
  • 52
  • 91

3 Answers3

4

According to the docs, when a python object is passed as an argument to '{}'.format(obj),

A general convention is that an empty format string ("") [within the "{}"] produces the same result as if you had called str() on the value.

This means you're effectively calling str(vars(customer)), and vars(customer) returns a dict.

Calling str() on a dict will call repr() on its keys and values because otherwise you'd get ambiguous output (eg str(1) == str('1') == '1' but repr(1) == '1' and repr('1') == '"1"' (see Difference between __str__ and __repr__ in Python)

Therefore repr() is still being called on your Address, which returns a string.

Now returning unicode from repr() is not allowed in Python 2 - https://stackoverflow.com/a/3627835/648176, so you'll need to either override __str__() in your model to make it handle decoding into ascii (Django docs), or do something like:

string_dict = {str(k): str(v) for (k, v) in vars(customer).items()}
log.debug(u'Generated customer [{}]'.format(string_dict))
Community
  • 1
  • 1
Fush
  • 2,469
  • 21
  • 19
1

Try decode for non utf-8 chars with:

def __unicode__(self):
        return u"{}, {} {}, {}".format(
            self.city, self.state_or_province.postal_abbrev, self.postal_code, self.address_1.decode('utf-8'))
juliocesar
  • 5,706
  • 8
  • 44
  • 63
  • So, I guess I am confused why the __repr__ is getting called. is this a problem with unicode, or a problem with __repr__ expecting a str object? – Nathan Tregillus Aug 31 '15 at 19:08
  • I think __repr__ is called because it is debuging, read this for differences http://stackoverflow.com/q/1436703/2343488 – juliocesar Aug 31 '15 at 19:32
  • Note that unicode is the name of the method, the string returned could be encoded with any charset – juliocesar Aug 31 '15 at 19:46
  • I think this is where my python knowledge of the unicode object is muddying the waters. You can have a unicode object but not an encode it? I thought the unicode string had to be utf-8. when I used the encode('utf-8') function, my resulting object is a str object, which I thought I was not suppose to return from a __unicode__ method – Nathan Tregillus Aug 31 '15 at 22:50
0

This is more of a hack that a pretty answer, but I'll still throw my two cents to the pile. Just subclass the "logging.Handler" you are using, and change the 'emit' method (if it is the one causing the exceptions).

Pros

Very easy to setup. After setup, no actions required with any model/data.

Cons

The result is that there will be no UnicodeErrors, but the log file will have "strange looking strings starting with a backslash" where ever there was a unicode mark. For example will turn into '\xf0\x9f\xa6\x84\'. Perhaps you could use a script to translate the '\xf0\x9f\xa6\x84\' back to unicode inside the log file when needed.

The steps are

1) Make a "custom_logging.py", which you can import to your settings.py

from logging import FileHandler

class Utf8FileHandler(FileHandler):
    """
          This is a hack-around version of the logging.Filehandler

        Prevents errors of the type
        UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f984' in position 150: character maps to <undefined>
    """
    def __init__(self, *args, **kwargs):
        FileHandler.__init__(self, *args, **kwargs)

    def emit(self, record):
        """
        Emit a record.

        If a formatter is specified, it is used to format the record.
        The record is then written to the stream with a trailing newline.  If
        exception information is present, it is formatted using
        traceback.print_exception and appended to the stream.  If the stream
        has an 'encoding' attribute, it is used to determine how to do the
        output to the stream.
        """
        try:
            msg = self.format(record)
            stream = self.stream
            stream.write(msg)
            stream.write(self.terminator)
            self.flush()
        except Exception:
            # The hack.
            try:
                stream.write(str(msg.encode('utf-8'))[2:-1])
                stream.write(self.terminator)
                self.flush()
            # End of the hack.
            except Exception:
                self.handleError(record)

2) In your settings.py, use your custom made filehandler, like this (set the LOGGING['handlers']['file']['class'] to point to the custom_logging module.):

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'verbose': {
            'format': '%(levelname)s %(asctime)s %(module)s %(process)d %(thread)d %(message)s'
        },
    },
    'handlers': {
        'file': {
            'level': 'DEBUG',
            'class': 'config.custom_logging.Utf8FileHandler',
            'filename': secrets['DJANGO_LOG_FILE'],
            'formatter': 'verbose',
        },
    },
    'loggers': {
        'django': {
            'handlers': ['file'],
            'level': 'DEBUG',
            'propagate': True,
        },
    },
}
Niko Föhr
  • 28,336
  • 10
  • 93
  • 96