12

The requirement:

We have values in DB like

Chennai
Baroda
Bangalore
New Delhi
São Paulo, Lisboa
San Jose

etc...

So I want to convert these string to a unique short string. For example

Chennai –> xy67kr

San Jose –> iuj73d

basically something similar to URL shortner.

And the algorithm to convert this should be reversible.. i.e when I pass "xy67kr" to a decode function it should give me back "Chennai".

Looking forward for help.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
Taher
  • 732
  • 1
  • 9
  • 26
  • Do the strings need to be of a fixed length? – Chetter Hummin Mar 30 '12 at 08:21
  • 1
    If you have a database, then the processing of reversal should be pretty easy... – Oliver Charlesworth Mar 30 '12 at 08:21
  • 1 - Strings are not fixed length. Max length = 200 characters 2 - I want to avoid DB call. That is the reason I want to generate an algorithm. Which can be used in DB to encode the strings. Same algorithm can be used to decode and get real value in my web application – Taher Mar 30 '12 at 08:22
  • 5
    @taher: There is no function that can shorten arbitrary strings in a way that can be reversed (due to the [pigeonhole principle](http://en.wikipedia.org/wiki/Pigeonhole_principle)). Unless you can place heavy constraints one the values of your input strings, you will have to use some kind of lookup mechanism. – Oliver Charlesworth Mar 30 '12 at 08:31
  • 4
    I don't get it.. you have it *in* DB but you don't want to use the DB? – Karoly Horvath Mar 30 '12 at 08:39
  • what you can do is compress the strings with something like huffman coding for example - but this only makes sense if your strings are quite long. In the case of "Chennai", "Baroda", etc. it will make things worse. – Christian Mar 30 '12 at 08:40
  • Agree with @Oli. There probably are constraints though (eg. characters being a-z thereby limiting range or pigeons) – Om Deshmane Mar 30 '12 at 08:47
  • @KarolyHorvath: We have two systems. One which works in backend and is responsible for encoding the string and storing in DB. We have a SOLR indexer which indexes everything and our front end website uses values from index files (also the codes). To get original value from code, I do not want to go back to DB. – Taher Mar 30 '12 at 08:57
  • @Oil Charlesworth: To avoid lookup mechanism only I need this algorith :) – Taher Mar 30 '12 at 09:38
  • 1
    @taher: But this algorithm doesn't exist... – Oliver Charlesworth Mar 30 '12 at 10:04

4 Answers4

5

As other posters have stated, you cannot have a function that shortens arbitrary strings, that's mathematically impossible. But you can create a custom function that works well with your particular set of strings.

An example approach would be to compute the character frequency in the set, then just code the characters with a prefix code such that the most frequent letters are encoded with short prefixes (i.e. Huffman coding.)

The approach above does not take advantage of the fact that in natural language the next character can be quite accurately predicted from previous ones, so you can extend the above algorithm so that instead of encoding the characters independently, it encodes the next character in an n-gram. This of course requires a larger compression table than the simple approach, since you are effectively having a separate code depending on the prefix. For example if 'e' is very frequent after 'th' then 'e' after 'th' is encoded with a very short prefix. If 'e' is very unfrequent after 'ee', then it can be encoded with a very long prefix in this case. The decoding algorithm obviously needs to look at the currently decompressed prefix to check how to decode the next character.

This general approach assumes that the frequencies do not change, or at least change slowly. If your data set changes than you might have to recompute the statistics and re-code the strings.

Rafał Dowgird
  • 43,216
  • 11
  • 77
  • 90
  • I doubt that this will work well for short input data. It also seems that the OP wants a fixed-length encoding, which clearly is impossible. – Oliver Charlesworth Mar 30 '12 at 13:21
  • @OliCharlesworth On the contrary, this kind of statistical encoding works well even for single character strings, save for the fact that even if the resulting code is say, 6 bits, then you still have to send (or save) at least one byte. I agree that fixed-length encoding is impossible. – Rafał Dowgird Mar 30 '12 at 13:48
  • Ok, in my original question I asked that my input strings can be of variable length. So, suppose that I make them of fixed length by applying padding, i.e --> New York [becomes] --> New York!@!!@! or something like that. Is it possible then to shorten them after encoding? – Taher Apr 02 '12 at 06:10
  • @taher Oli was referring to the length of the string *after* encoding. The pigeonhole principle states that the only way to guarantee that the final string is fixed is to restrict the set of input strings (so that its size is no larger than the number of the fixed length strings.) The only practical way to do this (for an arbitrary set) is to use a DB, just like URL shorteners do. The best you can do without a DB is to use a compression algorithm tuned to your data. This can achieve very good compression - no fixed size output though. – Rafał Dowgird Apr 02 '12 at 07:23
  • Thanks guys, I kind of got my answer. – Taher Apr 04 '12 at 09:51
4

See my answer to similar question, and just rewrite it to PHP:

Encoding:

$encoded = base64_encode(gzdeflate("São Paulo, Lisboa"))

Decoding:

$decoded = gzinflate(base64_decode($encoded))

Note that gzdeflate performs better than gzcompress on short strings.

But anyway the problem with this is that for short strings it makes the string longer. This performs better on longer texts. It would be of course better to use some compression algorithm with a-priori information, like ppm or suffix method with initial suffix tree... then it would work perfectly on short strings also.

Community
  • 1
  • 1
Tomas
  • 57,621
  • 49
  • 238
  • 373
  • It would be of course better to use some **compression algorithm with a-priori information**, like ppm or suffix method with initial suffix tree... then it would work perfectly on short strings also. But then the question is if these methods are accessible within PHP. – Tomas Mar 30 '12 at 10:07
3

You cannot shorten arbitrary length strings to fixed length one.

What you can do is to create those short strings for the unique ID of the row of that specific string in the database. Here are some tips: How to design a sequential hash-like function.

Community
  • 1
  • 1
Karoly Horvath
  • 94,607
  • 11
  • 117
  • 176
1

This isn't necessarily deterministic, but obviously you could use a lookup table. The service would be similar to goo.gl or imgur