6

Trying to get a double-precision floating point score from a UTF-8 encoded string object in Python. The idea is to grab the first 8 bytes of the string and create a float, so that the strings, ordered by their score, would be ordered lexicographically according to their first 8 bytes (or possibly their first 63 bits, after forcing them all to be positive to avoid sign errors).

For example:

get_score(u'aaaaaaa') < get_score(u'aaaaaaab') < get_score(u'zzzzzzzz')

I have tried to compute the score in an integer using bit-shift-left and XOR, but I am not sure of how to translate that into a float value. I am also not sure if there is a better way to do this.

How should the score for a string be computed so the condition I specified before is met?

Edit: The string object is UTF-8 encoded (as per @Bakuriu's commment).

Juan Carlos Coto
  • 11,900
  • 22
  • 62
  • 102
  • Unicode does **not** have "bytes", hence your question is meaningless. You probably mean a specific encoding of a certain unicode string. In this case you must specify the encoding. – Bakuriu Oct 23 '13 at 18:54
  • 64 bits will be impossible because not all `double` values are orderable; even 63 bits is unlikely. Any possibility of living with 56 bits? – Mark Ransom Oct 23 '13 at 20:15
  • Absolutely. My ultimate goal is getting as much data from the string, preserving order, into the float. That way, the float score will give an approximate "absolute" score of the string. Can totally live with that :) – Juan Carlos Coto Oct 23 '13 at 20:40

2 Answers2

3

float won't give you 64 bits of precision. Use integers instead.

def get_score(s):
  return struct.unpack('>Q', (u'\0\0\0\0\0\0\0\0' + s[:8])[-8:])[0]

In Python 3:

def get_score(s):
  return struct.unpack('>Q', ('\0\0\0\0\0\0\0\0' + s[:8])[-8:].encode('ascii', 'error'))[0]

EDIT:

For floats, with 6 characters:

def get_score(s):
  return struct.unpack('>d', (u'\0\1' + (u'\0\0\0\0\0\0\0\0' + s[:6])[-6:]).encode('ascii', 'error'))[0]
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • `TypeError: 'str' does not support the buffer interface` – Bakuriu Oct 23 '13 at 18:57
  • @Bakuriu: Are you sure you got that with this code? Which version of Python? – Ignacio Vazquez-Abrams Oct 23 '13 at 18:58
  • Yes, I copy-pasted it in python3. Python2 is much more indulgent when mixing unicode and bytes. In fact I believe its not safe to use this function across different interpreters, since the value may differ between unicode implementations. You should first encode the string in order to get a "canonical" representation. – Bakuriu Oct 23 '13 at 19:00
  • @Ignacio Ok, thanks. However, I need a double-precision float (using this for a Redis ZSet, which uses 64 bit floats as scores). How would I go around getting that? – Juan Carlos Coto Oct 23 '13 at 19:28
  • 1
    @JuanCarlosCoto: First you chop it down to 6 characters since that's all the precision you can get, unless you can restrict the number of characters used to only 98. Then you do a naive base conversion to a number, and then you append the 53 bits in that number to 11 0s. Then you convert *that* to a `float`. – Ignacio Vazquez-Abrams Oct 23 '13 at 20:29
  • @IgnacioVazquez-Abrams What do you mean by "naive base conversion"? Can you provide an example or amend your answer, please? – Juan Carlos Coto Oct 23 '13 at 21:20
  • 1
    @JuanCarlosCoto: Have you decided what 98 characters you're going to use? – Ignacio Vazquez-Abrams Oct 23 '13 at 21:24
  • @IgnacioVazquez-Abrams: I can't restrict the number of characters to be used, so I'll just have to use 6 characters. – Juan Carlos Coto Oct 23 '13 at 21:29
  • Thanks, @IgnacioVazquez-Abrams. Is there some resource you can recommend so I can understand what is going on there? I am kind of lost with the reversing of the indexes and the Unicode characters in the struct packing. Thanks again! – Juan Carlos Coto Oct 23 '13 at 21:39
  • 1
    @JuanCarlosCoto: There's no reference for it since I came up with it on the fly. I recommend you play around with string slicing in the REPL if you want to understand it more. The rest is [here](http://en.wikipedia.org/wiki/Double-precision_floating-point_format). – Ignacio Vazquez-Abrams Oct 23 '13 at 21:41
1

You will need to setup the entire alphabet and do the conversion by hand, since conversions to base > 36 are not built in, in order to do that you only need to define the complete alphabet to use. If it was an ascii string for instance you would create a conversion to a long in base 256 from the input string using all the ascii table as an alphabet.

You have an example of the full functions to do it here: string to base 62 number

Also you don't need to worry about negative-positive numbers when doing this, since the encoding of the string with the first character in the alphabet will yield the minimum possible number in the representation, which is the negative value with the highest absolute value, in your case -2**63 which is the correct value and allows you to use < > against it.

Hope it helps!

Community
  • 1
  • 1
Sergio Ayestarán
  • 5,590
  • 4
  • 38
  • 62
  • So, to make sure I understand: take the first 8 characters of the string and treat them as a number in base 256 and convert them back to base 10? – Juan Carlos Coto Oct 23 '13 at 21:33
  • Yes, but the base is defined by the amount of different characters you have in your input string, if you only use 128 different characters you can use base 128 instead for instance. If you don't know in advance the base is probably higher than 256 up to all the possible unicode characters encoded in utf-08 (base 1024 if I remember correctly). – Sergio Ayestarán Oct 23 '13 at 23:35