0

I am working on python 2-3 compatibility. When working with str and byte types, there is an issue I am coming across. Here is an example

# python 2
x = b"%r" % u'hello' # this returns "u'hello'"

# python 3
x = b"%r" % u'hello' # this returns b"'hello'"

Notice how the extra unicode u appears in the final representation of x in python 2? I need to make my code return the same value in python3 and python2. My code can take in str, bytes, or unicode values.

I can coerce the python 3 value to the python 2 value by doing

# note: six.text_type comes from the six compatibility library. Basically checks to see if something is unicode in py2 and py3. 
new_data = b"%r" % original_input
if isinstance(original_input, six.text_type) and not new_data.startswith(b"u'"):
    new_data = b"u%s"

This makes the u'hello' case work correct but messes up the 'hello' case. This is what happens:

# python 2
x = b"%r" % 'hello' # this returns "'hello'"

# python 3
x = b"%r" % 'hello' # this returns b"'hello'"

The problem is that in python 3 u'hello' is the same as 'hello', So if I include my code above, the result for both u'hello and 'hello' end up returning the same result as u'hello in python 3.

So I need some kind of way to tell if a python 3 input string explicitly has specified the u in front of the string, and only execute my code above if that case is satisfied.

fooiey
  • 1,040
  • 10
  • 23
  • I think in Python 3 all strings are Unicode, that's why it no longer uses the `u` prefix. – Barmar Dec 19 '21 at 20:26
  • Yeah I think Barmar is right, if you do `type("")` and `type(u"")` in Python3, both give `str`, but in Python2 they give `str` and `unicode`. So maybe you could go the other way and make sure the `u` doesn't show up in Python2 if that's possible with your requirements. – Henry Woody Dec 19 '21 at 20:28
  • 4
    This is like trying to get your code to do different things with `f(1+1)` and `f(2)`. Why are you trying to do this? You probably need to change how you're approaching the underlying goal. – user2357112 Dec 19 '21 at 20:29
  • Right, but because in python 2 passing in explicit unicode string vs a normal string return different values, I need some way of differentiating this in python 3. Yeah, I suppose another option is removing the support for passing in explicit unicode strings in python 2. – fooiey Dec 19 '21 at 20:30
  • @user2357112supportsMonica, I have legacy code that in theory can take in all 3 types of data (b'hello', 'hello', and u'hello') and in the process of migrating to python3 want to make sure that it doesn't produce a different kind of output given the exact same inputs. – fooiey Dec 19 '21 at 20:32
  • You should find the code that distinguishes between `str` and `unicode`, and see what the str-specific part does in Python 3 when given a string that contains Unicode. – Barmar Dec 19 '21 at 20:45
  • 1
    @fooiey: But you don't *have* 3 types of data. You have 3 ways of writing 2 types of data. You need to figure out where `'asdf'` needs to be bytes and where it needs to be Unicode and handle each case appropriately, not try to invent a third data type - and when I say you need to figure this out, I mean an actual human thinking about things in the process of code migration, not some sort of function logic that would handle it automatically. – user2357112 Dec 19 '21 at 20:53
  • Do you need code that can run in python2 or python3 and produce a specific output, regardless of what the input type is, from the three types 'str', 'unicode', 'bytes' -- some of which are only defined specific versions of python? – Kenny Ostrom Dec 19 '21 at 21:02
  • 2
    The short answer is, *you can't tell*, not inside Python code, anyway. The `u` string prefix in Python 3 is a no-op and is there purely to ease migration of Python 2 code. To illustrate this, type `u"hello" is "hello"` at a Python 3 prompt. You will get `True`. To do what you want you are going to have to parse the source code yourself. Though that should be enough to make it clear that your approach needs work. – BoarGules Dec 19 '21 at 22:25
  • FYI, Python 2 is officially not supported (EOL). I have successfully completely abandoned it. – Keith Dec 20 '21 at 01:22

1 Answers1

0

It's a simple matter of knowing what version of python you are currently executing, and looking at the type of the input. Of course, this is just taking what data you have and producing a consistent output. It's not going to recover syntactic sugar from the "original source code" because that's not the data you have to work with. I'm just going for a consistent output like you asked for when you said, "I need to make my code return the same value in python3 and python2."

In python2 you'll probably be dealing with str and unicode.
In python3 you'll probably be dealing with bytes and str.

Look at the python version first, because if you compare to a data type that doesn't exist in that version, it will raise an exception just trying to do the check.

import six

if six.PY2:
    samples = ['hello', u'hello']
elif six.PY3:
    samples = ['hello', bytes('hello', 'utf-8')]
else:
    raise ValueError('python version unknown')

def normalize(message):
    if six.PY2:
        if type(message)==unicode:
            return str(message)
        elif type(message)==str:
            return message
        else:
            raise ValueError('expected string type, got ' + message.__class__.__name__)
    elif six.PY3:
        if type(message)==bytes:
            return message.decode('utf-8')
        elif type(message)==str:
            return message
        else:
            raise ValueError('expected string type, got ' + message.__class__.__name__)
    else:
        raise ValueError('python version unknown')

for message in samples:
    print(normalize(message))

This is tested on 2.7.5 and 3.9.2
If you have bytes in python2, it's just an alias for str (https://stackoverflow.com/a/5901825/1766544)

Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30