5

Given a function like:

import six

def convert_to_unicode(text):
  """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
  if six.PY3:
    if isinstance(text, str):
      return text
    elif isinstance(text, bytes):
      return text.decode("utf-8", "ignore")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  elif six.PY2:
    if isinstance(text, str):
      return text.decode("utf-8", "ignore")
    elif isinstance(text, unicode):
      return text
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  else:
    raise ValueError("Not running on Python2 or Python 3?")

Since six handles the python2 and python3 compatibility, would the above convert_to_unicode(text) function be equivalent to just six.text_type(text)? I.e.

def convert_to_unicode(text):
    return six.text_type(text)

Are there cases that the original convert_to_unicode capture but six.text_type can't?

alvas
  • 115,346
  • 109
  • 446
  • 738

1 Answers1

5

Since six.text_type ist just a reference to the str or unicode type, an equivalent function would be this:

def convert_to_unicode(text):
    return six.text_type(text, encoding='utf8', errors='ignore')

But it doesn't behave the same in the corner cases, eg. it will just happily convert an integer, so you'd have to put some checks there first.

Also, I don't see why you would want to have errors='ignore'. You say you assume UTF-8. But if this assumption is violated, you are silently deleting data. I would strongly suggest using errors='strict'.

EDIT:

I just realised this doesn't work if text is already what you want. Also, it happily raises a TypeError for any non-string input. So how about this:

def convert_to_unicode(text):
    if isinstance(text, six.text_type):
        return text
    return six.text_type(text, encoding='utf8', errors='ignore')

The only corner case uncovered here is that of the Python version being neither 2 nor 3. And I still think you should use errors='strict'.

lenz
  • 5,658
  • 5
  • 24
  • 44
  • Note: `six` version 1.12 has `six.ensure_text()` which I suppose does just what you need. – lenz Oct 04 '19 at 07:41
  • @alvas, is there something you were missing from my answer? Like an explanation why this covers all cases from the function you posted? – lenz Oct 18 '19 at 08:32