How do I represent a string as a number?

Question

I need to represent a string as a number, however it is 8928313 characters long, note this string can contain more than just alphabet letters, and I have to be able to convert it back efficiently too. My current (too slow) code looks like this:

alpha = 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,.?!@()+-=[]/*1234567890^*{}\'"$\\&#;|%<>:`~_'
alphaLeng = len(alpha)
def letterNumber(letters):
    letters = str(letters)
    cof = 1
    nr = 0
    for i in range(len(letters)):
        nr += cof*alpha.find(letters[i])
        cof *= alphaLeng
        print(i,'        ',len(letters))
    return str(nr)

I seriously question why you need a number with about 8 MB of precision. Ultimately, a string and a number are both bytes. A string is the data type you want. — Alex Huszagh, Jun 29 '17 at 19:04
Are you using Python 3 or Python 2? You aren't using *both* I assume? — juanpa.arrivillaga, Jun 29 '17 at 19:11

Alex Huszagh · Answer 1 · 2017-06-29T19:50:34.217

Ok, since other people are giving awful answers, I'm going to step in.

You shouldn't do this.
You shouldn't do this.
An integer and an array of characters are ultimately the same thing: bytes. You can access the values in the same way.
Most number representations cap out at 8 bytes (64-bits). You're looking at 8 MB, or 1 million times the largest integer representation. You shouldn't do this. Really.
You shouldn't do this. Your number will just be a custom, gigantic number type that would be identical under the hood.
If you really want to do this, despite all the reasons above, here's how...

Code

def lshift(a, b):
    # bitwise left shift 8
    return (a << (8 * b))

def string_to_int(data):
    sum_ = 0
    r = range(len(data)-1, -1, -1)
    for a, b in zip(bytearray(data), r):
        sum_ += lshift(a, b)
    return sum_;

DONT DO THIS

Explanation

Characters are essentially bytes: they can be encoded in different ways, but ultimately you can treat them within a given encoding as a sequence of bytes. In order to convert them to a number, we can shift them left 8-bits for their position in the sequence, creating a unique number. r, the range value, is the position in reverse order: the 4th element needs to go left 24 bytes (3*8), etc.

After getting the range and converting our data to 8-bit integers, we can then transform the data and take the sum, giving us our unique identifier. It will be identical byte-wise (or in reverse byte-order) of the original number, but just "as a number". This is entirely futile. Don't do it.

Performance

Any performance is going to be outweighed by the fact that you're creating an identical object for no valid reason, but this solution is decently performant.

1,000 elements takes ~486 microseconds, 10,000 elements takes ~20.5 ms, while 100,000 elements takes about 1.5 seconds. It would work, but you shouldn't do it. This means it's scaled as O(n**2), which is likely due to memory overhead of reallocating the data each time the integer size gets larger. This might take ~4 hours to process all 8e6 elements (14365 seconds, calculated fitting the lower-order data to ax**2+bx+c). Remember, this is all to get the identical byte representation as the original data.

Futility

Remember, there are ~1e78 to 1e82 atoms in the entire universe, on current estimates. This is ~2^275. Your value will be able to represent 2^71426504, or about 260,000 times as many bits as you need to represent every atom in the universe. You don't need such a number. You never will.

"Characters are essentially 8-bit integers" Mmmm that depends. They could be unicode code points instead. In which case, they are represented by 1-4 bytes. — juanpa.arrivillaga, Jun 29 '17 at 19:33
@juanpa.arrivillaga Which can be represented by 8-bit integers, using UTF-8. I never specified an encoding. Even a code point encoded as UTF-32 would work identically for this sake. The whole idea of the OP is just awful, regardless. — Alex Huszagh, Jun 29 '17 at 19:34
Yes, I understand. I just think that comment is a bit vague and misleading. In other words, it is sticking to the Cish "character is a byte" abstraction. I have no doubt you understand these subtleties, but they may not be clear to others reading your answer. Character encodings are just one of those things that people just figure they can get away with not really understanding. — juanpa.arrivillaga, Jun 29 '17 at 19:36

score 1 · Answer 2 · answered Jun 29 '17 at 19:04

1

If there are only ANSII characters. You can use ord() and chr().

built-in functions

answered Jun 29 '17 at 19:04

lwshang

81
4

`ord` and `chr` work for all characters. But aside from that, I believe this is the most reasonable answer. – juanpa.arrivillaga Jun 29 '17 at 19:12
For python 3, those functions work for all Unicode characters. But python 2 only support ANSII. – lwshang Jun 29 '17 at 19:18
You are right. I'm not that familiar with code support. I just tried `ord('x')` where x is a Chinese word in python 2 and got an error. But the error is actually from the character literal, which should be `ord(u'x')`. – lwshang Jun 29 '17 at 19:29
Right, because in Python 2, `ord('x')` where `x` is some chines word, will be some multi-byte sequence, so `'x'` will have length >1. If you pre-pend with `u"x"`, then `u''x"` will be a single-element *unicode code point sequence*, with length 1. – juanpa.arrivillaga Jun 29 '17 at 19:31

score 0 · Answer 3 · answered Jun 29 '17 at 19:10

There are several optimizations you can perform. For example, the find method requires searching through your string for the corresponding letter. A dictionary would be faster. Even faster might be (benchmark!) the chr function (if you're not too picky about the letter ordering) and the ord function to reverse the chr. But if you're not picky about ordering, it might be better if you just left-NULL-padded your string and treated it as a big binary number in memory if you don't need to display the value in any particular format.

You might get some speedup by iterating over characters instead of character indices. If you're using Python 2, a large range will be slow since a list needs to be generated (use xrange instead for Python 2); Python 3 uses a generator, so it's better.

Your print function is going to slow down output a fair bit, especially if you're outputting to a tty.

A big number library may also buy you speed-up: Handling big numbers in code

Thank you, I will try to do the dictionary. The print is just there for debugging purposes, and will be removed shortly.And using Python 3.6, are the any othe optimizations possible? — Jean Du Plessis, Jun 29 '17 at 19:41

score 0 · Answer 4 · answered Jun 29 '17 at 19:19

Your alpha.find() function needs to iterate through alpha on each loop.

You can probably speed things up by using a dict, as dictionary lookups are O(1):

alpha = 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,.?!@()+-=[]/*1234567890^*{}\'"$\\&#;|%<>:`~_'

alpha_dict = { letter: index for index, letter in enumerate(alpha)}
print(alpha.find('$'))
# 83
print(alpha_dict['$'])
# 83

score 0 · Answer 5 · answered Nov 30 '22 at 00:20

0

Store your strings in an array of distinct values; i.e. a string table. In your dataset, use a reference number. A reference number of n corresponds to the nth element of the string table array.

answered Nov 30 '22 at 00:20

Dan Kranz

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 02 '22 at 10:32

How do I represent a string as a number?

5 Answers5