1

I am trying to convert strings to integers and sort them based on the integer value. These values should be unique to the string, no other string should be able to produce the same value. And if a string1 is bigger than string2, its integer value should be greater. Ex: since "orange" > "apple", "orange" should have a greater integer value. How can I do this?

I know there are an infinite number of possibilities between just 'a' and 'b' but I am not trying to fit every single possibility into a number. I am just trying to possibly sort, let say 1 million values, not an infinite amount.

I was able to get the values to be unique using the following:

long int order = 0;
for (auto letter : word)
    order = order * 26 + letter - 'a' + 1;
return order;

but this obviously does not work since the value for "apple" will be greater than the value for "z".

This is not a homework assignment or a puzzle, this is something I thought of myself. Your help is appreciated, thank you!

  • 5
    Is this a homework assignment, or an internet puzzle? Or is it just something you thought up? I ask because the question as asked (that is, without conditions) is impossible. There are an unbounded number of possible strings lexicographically between `a` and `b`, so no code value is suitable for `b`. – rici Mar 16 '22 at 04:05
  • I know there are an infinite number of possibilities between just 'a' and 'b' but I am not trying to fit every single possibility into a number. I am just trying to possibly sort, let say 1 million values, not an infinite amount. @rici – Mena Bebawy Mar 16 '22 at 04:18
  • 2
    @mena: then edit your question appropriately. (Even so, it's not enough that you will only need to encode a million values. It's the universe of possible values which needs to be finite.) – rici Mar 16 '22 at 04:25
  • 2
    A string is **already** functionally a number represented in base 256 (assuming a `char` element type). – Karl Knechtel Mar 16 '22 at 04:25
  • @karl: of course. But those numbers do not preserve lexicographical order, which is what OP is looking for. – rici Mar 16 '22 at 04:54
  • 1
    They do for strings of equal length, and for strings of unequal length the longer string is "greater" by this metric. Given these facts it is trivial to write a comparator. – Karl Knechtel Mar 16 '22 at 04:56
  • @karl: quoting the OP: «this obviously does not work since the value for "apple" will be greater than the value for "z".» – rici Mar 16 '22 at 05:48
  • Then I don't actually understand what ordering is desired - or **why**, for that matter. – Karl Knechtel Mar 16 '22 at 06:00
  • I’m not sure I understand what you’re trying to do. A string of characters is a number in base N, where N is the size of the alphabet. You just have to account also for the size of the string. `”z”` is before `”ab”` for it is shorter. – zdf Mar 16 '22 at 06:27
  • 1
    why to heck is this closed as too broad??? the **question contains all the info** that is needed to solve this (constraints included) and even if its homework/assignment the provided code is almost working solution with very good description on what is wrong and only a minor changes are needed to remedy it... Its much much higher quality than questions that should really be closed and are keep open lately ... Voting for reopen – Spektre Mar 16 '22 at 09:40

2 Answers2

3

You are almost there ... just a minor tweaks are needed:

  1. you are multiplying by 26

    however you have letters (a..z) and empty space so you should multiply by 27 instead !!!

  2. Add zeropading

    in order to make starting letter the most significant digit you should zeropad/align the strings to common length... if you are using 32bit integers then max size of string is:

    floor(log27(2^32)) = 6
    floor(32/log2(27)) = 6
    

Here small example:

int lexhash(char *s)
    {
    int i,h;
    for (h=0,i=0;i<6;i++) // process string
        {
        if (s[i]==0) break;
        h*=27;
        h+=s[i]-'a'+1;
        }
    for (;i<6;i++) h*=27; // zeropad missing letters
    return h;
    }

returning these:

  14348907      a
  28697814      b
  43046721      c
 373071582      z
  15470838    abc
 358171551    xyz
  23175774  apple
 224829626 orange

ordered by hash:

  14348907      a
  15470838    abc
  23175774  apple
  28697814      b
  43046721      c
 224829626 orange
 358171551    xyz
 373071582      z

This will handle all lowercase a..z strings up to 6 characters length which is:

26^6 + 26^5 +26^4 + 26^3 + 26^2 + 26^1 = 321272406 possibilities

For more just use bigger bitwidth for the hash. Do not forget to use unsigned type if you use the highest bit of it too (not the case for 32bit)

Spektre
  • 49,595
  • 11
  • 110
  • 380
  • To get round the fixed length issue, you could try with a decimal (256mal?) point after the first letter: a. a.t a.te b.e b.ean b.ee b.roken ... This would mean working in floating point, or some multi-precision float equivalent. I assume GMP has something. – rossum Mar 16 '22 at 18:28
  • @rossum floating point is not a good choice for this because of rounding errors, if you want more then just use biginteger, or just an array of unsigned ints with [ALU](https://stackoverflow.com/a/26603589/2521214) or use base 32 (5 bits per char) instead of 27 to simplify things to just bit operations at the cost of wasted space... anyway even this 6 char limit is bigger than OP wants... (1M < 321M) – Spektre Mar 16 '22 at 19:40
  • That is why I suggested using using a high precision floating point which would be sufficiently accurate for reasonably long words. Perhaps some equivalent of Java's `BigDecimal`. – rossum Mar 16 '22 at 20:29
0

You can use position of char:

std::string s("apple");
int result = 0;
for (size_t i = 0; i < s.size(); ++i)
    result += (s[i] - 'a') * static_cast<int>(i + 1);
return result;

By the way, you are trying to get something very similar to hash function.

teplandr
  • 176
  • 1
  • 9