2

We're using a file system/url safe variation of base64 encoding such that:

"=" replaced with ""  
"+" replaced with "-"  
"/" replaced with "_"  

We are now using Azure blob storage that does not allow use of "_" within container names.

We are base64 encoding a Guid. If I was to replace underscore with say a "0" am I at risk of collisions?

Update

Not sure why the downvote. But to clarify.

Why not just use a Guid?

  1. The Guid is the id of an entity within my application. Since the paths are public, I don't really like exposing the Id, hence why I'm encoding it.
  2. I want shorter and more friendly looking paths. Contrary to one of the comments below, the base 64 encoding is NOT longer:

    Guid: 5b263cdd-2bc2-485d-83d4-81b96930dc5a
    Base64 Encoded: 3TwmW8IrXUiD1IG5aTDcWg== (even shorter after removing ==)

(Another) Update

Seems there is some confusion about what it is I'm trying to achieve (so sorry about that). Heres the short version.

  • I have a Guid that represents an entity in my application.
  • I need to create a publicly accessible directory for the entity (via a Url).
  • I don't want to use the Guid as the directory name, for the reasons above.
  • I asked previously on SO about how I could generate a friendlier looking Url that guaranteed uniqueness and did not expose the original Guid. The suggestion was Base64 encoding.
  • This has worked fine until recently when we needed to use Azure blob storage, which does not allow underscores "_" in it's directory (Container) names.

This is where I'm at.

Ben Foster
  • 34,340
  • 40
  • 176
  • 285
  • Why do you need to use base-64 encoding to encode a GUID? The only characters valid in a GUID are '{', '}', '0'-'9', 'A'-'F' and '-'. – BlueMonkMN Jul 28 '11 at 11:09
  • 1
    What's the thinking process behind "let's encode a GUID with base64 because GUID has invalid chars and base64 has even more"? – VVS Jul 28 '11 at 11:15
  • @VVS: And a GUID has no invalid characters! – R. Martinho Fernandes Jul 28 '11 at 11:15
  • @Martinho Fernandes: I don't know the Azure storage and so I thought the OP confused "_" with "-", which is part of a GUID. – VVS Jul 28 '11 at 11:17
  • Which file systems don't like + and = characters in filenames? – David Heffernan Jul 28 '11 at 11:17
  • The reason is because we wanted to generate paths that are as short and friendly looking as possible, without losing uniqueness. @David - So file system/url safe and as per my question, Azure doesn't allow these. – Ben Foster Jul 28 '11 at 11:19
  • 2
    @Ben: base64-encoded data is always [longer](http://stackoverflow.com/questions/4715415/base64-what-is-the-worst-possible-increase-in-space-usage/4715480#4715480) than the original data. – R. Martinho Fernandes Jul 28 '11 at 11:23
  • What's "short and friendly" about a GUID? And which characters in a GUID are problematic? – jalf Jul 28 '11 at 11:24
  • @Martinho - disagree: Guid: 5b263cdd-2bc2-485d-83d4-81b96930dc5a Encoded: 3TwmW8IrXUiD1IG5aTDcWg== – Ben Foster Jul 28 '11 at 11:33
  • Well, I was misled by the mention that you were encoding a string. The string "5b263cdd-2bc2-485d-83d4-81b96930dc5a" is represented by the base64 string "NWIyNjNjZGQtMmJjMi00ODVkLTgzZDQtODFiOTY5MzBkYzVh". The 16 bytes of a GUID are represented by a 24-byte base64 string. – R. Martinho Fernandes Jul 28 '11 at 11:41
  • If you want shorter paths, why are you accepting a base16 answer? – David Heffernan Jul 28 '11 at 11:53
  • Well base16 are shorter, but as I just realized this doesn't really do a very good job of masking the original Guid (seems equivalent to just removing the dashes). By all means, provide me with another option. – Ben Foster Jul 28 '11 at 12:00
  • 1
    @Ben Tell us what are you really trying to do. – R. Martinho Fernandes Jul 28 '11 at 12:02
  • I've provided another update. I skipped over why I was using base64 encoding originally as it was suggested to me on another question on SO. – Ben Foster Jul 28 '11 at 12:11
  • You do know that [base64 is reversible](http://thedailywtf.com/Articles/Encrypted-XML.aspx), right? – R. Martinho Fernandes Jul 28 '11 at 12:13
  • @Martinho - sure, and it's not the end of the world if someone does work out the original Guid. – Ben Foster Jul 28 '11 at 12:21
  • 1
    It sounds like you want encryption rather than encoding. – David Heffernan Jul 28 '11 at 13:03
  • If you base64 encode it you still expose the ID. Anyone cane base64 decode your string. – teknopaul Feb 06 '19 at 17:54

4 Answers4

7

Just "encode" the GUID in base16. The only characters it uses are 0123456789ABCDEF which should be safe for most purposes.

var encoded = guid.ToString("N");
R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
  • Using Base16 results in a 33% longer string than using Base64. Having said that, if the OP finds a 24-character random-ish string "short and friendly" then I'm sure they wouldn't have too much trouble with a 32-character string either. – LukeH Jul 28 '11 at 11:36
  • 1
    @LukeH but it's 400% friendlier because it uses less distinct characters! :) – R. Martinho Fernandes Jul 28 '11 at 11:39
  • @Martinho, agree. I've also updated my question as to why I was encoding. Do I have any risk of collisions with this? – Ben Foster Jul 28 '11 at 11:43
  • 1
    @Ben: It's a 1-to-1 map, so you only have collisions if you have colliding GUIDs. – R. Martinho Fernandes Jul 28 '11 at 11:45
  • Just realized, this doesn't really mask the original Guid. Seems equivalent to just removing the "-"? – Ben Foster Jul 28 '11 at 11:59
  • @Ben and what's the problem with that? – R. Martinho Fernandes Jul 28 '11 at 12:01
  • 1
    @Ben: I thought you wanted some way of encoding a GUID that didn't have any invalid characters. This one fits that purpose. I won't suggest an alternative if you don't tell what other requirements you have. – R. Martinho Fernandes Jul 28 '11 at 12:06
  • To be fair, the question was that I'm *already* base64 encoding the Guid but I can't use underscores. However, I realize I should have been a bit more descriptive (hence my update). – Ben Foster Jul 28 '11 at 12:14
  • In the end I just went for the base 16 representation. Trying to make a shorter/masked version whilst guaranteeing uniqueness was just not worth the hassle. – Ben Foster Jul 28 '11 at 22:51
4

The base 64 character set is

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=

So you can't use 0 since it is already in use.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
0

Encoding your identifiers does not encrypt them. Any technically savvy observer can base64-uncode an identifier. If you want to make your paths opaque, then either encrypt them or hash them with a salt. If you do want to keep your paths transparent, just use hex without any hyphens or braces. That way, your UUID is serialized to 32 code points, whereas Azure container names can be up to 63 character long.


If you really want shorter and funnier container names, and if Azure supports internationalized domain names, Braille encoding fits the bill as the least typable option. Here's a Haskell one-liner for generating a UUIDv4, mapping each octet of the UUID to a braille letter and encoding the resulting string in UTF-16BE (for a total of 32 octets).

import Data.Binary (encode)
import Data.ByteString.Lazy (intersperse, cons)
import Data.Functor ((<&>))
import Data.UUID.V4 (nextRandom)

braille :: IO Data.ByteString.Lazy.Internal.ByteString
braille = nextRandom <&> encode <&> intersperse 40 <&> cons 40

(In F#, |> would be used instead of <&>.)

For your amusement, see the following gist for how to convert an octet-stream into UTF-16LE or UTF-8 encoded braille strings which makes each bit literally stand out.

https://gist.github.com/bjartur/ea5db281f0b88128455ed79621abbd1d

0

Instead of taking base64 and change 4 characters you could encode your data in base60.

Your base60 char list doesn't contain the 4 chars you don't like and so there's no need to replace anything.

VVS
  • 19,405
  • 5
  • 46
  • 65