-1

I'm trying to work out a way to encode/decode binary data in such a way that the new line character is not part of the encoded string.

It seems to be a recursive problem, but I can't seem to work out a solution.

e.g. A naive implementation:

>>> original = 'binary\ndata'

>>> encoded = original.replace('\n', '=n')
'binary=ndata'
>>> decoded = original.replace('=n', '\n')
'binary\ndata'

What happens if there is already a =n in the original string?

>>> original = 'binary\ndata=n'

>>> encoded = original.replace('\n', '=n')
'binary=ndata=n'
>>> decoded = original.replace('=n', '\n')
'binary\ndata\n'  # wrong

Try to escape existing =n's, but then what happens if there is already an escaped =n?

>>> original = '++nbinary\ndata=n'

>>> encoded = original.replace('=n', '++n').replace('\n', '=n')
'++nbinary=ndata++n'

How can I get around this recursive problem?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
gak
  • 32,061
  • 28
  • 119
  • 154
  • What's wrong with [Base64](http://en.wikipedia.org/wiki/Base64)? – Dour High Arch Nov 22 '12 at 23:30
  • @DourHighArch The size of the encoded value is important. Base64 has a ~33% overhead. I need it to be reduced to be similar to the original size. – gak Nov 22 '12 at 23:35
  • 1
    Why do you "need" that? How about a [~25% overhead](http://en.wikipedia.org/wiki/Base85)? Does the output have to be ASCII? Why is a newline not acceptable? This sounds like an XY problem; you need to tell us where these strange requirements are coming from. – Dour High Arch Nov 23 '12 at 00:22
  • @DourHighArch What if there is no underlying requirement apart from it being an interesting problem? – gak Nov 23 '12 at 00:47
  • 1
    I'm having a similar problem. The requirement is that the binary data doesn't need to be encoded into text, it only needs to be encoded so that it can be delimited by a single character. It seems like even a 25% overhead is pretty high when I only need to make one character available. Additionally, because the binary data *probably* won't have cases of this escape-aliasing, it's acceptable to just repeatedly escape. – Lucretiel Oct 17 '13 at 16:11
  • "It seems to be a recursive problem" I have no idea what motivates this comment. There is definitely no recursion shown in any of the code examples, nor would it be useful in a solution. – Karl Knechtel Aug 05 '22 at 03:30
  • How can you have newline characters in binary data? Your question embodies a contradiction in terms. – user207421 Aug 05 '22 at 03:59

7 Answers7

1

The way to encode strings that might contain the "escape" character is to escape the escape character as well. In python, the escape character is a backslash, but you could use anything you want. Your cost is one character for every occurrence of newline or the escape.

To avoid confusing you, I'll use forward slash:

# original
>>> print "slashes / and /newline/\nhere"
slashes / and /newline/
here
# encoding
>>> print "slashes / and /newline/\nhere".replace("/", "//").replace("\n", "/n")
slashes // and //newline///nhere

This encoding is unambiguous, since all real slashes are doubled; but it must be decoded in a single pass, so you can't just use two successive calls to replace():

# decoding
>>> def decode(c):
    # Expand this into a real mapping if you have more substitutions
    return '\n' if c == '/n' else c[0]

>>> print "".join( decode(c) for c in re.findall(r"(/.|.)", 
                                         "slashes // and //newline///nhere"))
slashes / and /newline/
here

Note that there is an actual /n in the input (and another slash before the newline): it all works correctly anyway.

alexis
  • 48,685
  • 16
  • 101
  • 161
1

Solution

original = 'binary\ndata \\n'
# encoded = original.encode('string_escape')                   # escape many chr
encoded = original.replace('\\', '\\\\').replace('\n', '\\n')  # escape \n and \\
decoded = encoded.decode('string_escape')

verified

>>> print encoded
binary\ndata \\n
>>> print decoded
binary
data \n

The solution is from How do I un-escape a backslash-escaped string in python?

Edit: I wrote it also with your ad-hoc economic encoding. The original "string_escape" codec escapes backslash, apostrophe and everything below chr(32) and above chr(126). Decoding is the same for both.

Community
  • 1
  • 1
hynekcer
  • 14,942
  • 6
  • 61
  • 99
  • Good idea. I did toy with string_escape, but didn't think to only use it for decoding. – gak Nov 24 '12 at 19:38
0

If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?

nair.ashvin
  • 791
  • 3
  • 11
  • What if there is a chr(ord('\n') - 1) in the original string? Wouldn't the encoded string have a `\n` in it? – gak Nov 22 '12 at 23:40
  • Ah, okay. So yeah, I can't think of a clever way to not use a character that you simply do not map to anything. *Accepts defeat* – nair.ashvin Nov 22 '12 at 23:49
0

I don't have a great deal of experience with binary data, so this may be completely off/inefficient/both, but would this get around your issue?

In [40]: original = 'binary\ndata\nmorestuff'

In [41]: nlines = [index for index, i in enumerate(original) if i == '\n']

In [42]: encoded = original.replace('\n', '')

In [43]: encoded
Out[43]: 'binarydatamorestuff'

In [44]: decoded = list(encoded)

In [45]: map(lambda x: decoded.insert(x, '\n'), nlines)
Out[45]: [None, None]

In [46]: decoded = ''.join(decoded)

In [47]: decoded
Out[47]: 'binary\ndata\nmorestuff'

Again, I am sure there is a much better/more accurate way - this is just from a novice perspective.

RocketDonkey
  • 36,383
  • 7
  • 80
  • 84
  • Interesting idea. There's a step missing-you also need to encode the positions in the encoded string. – gak Nov 22 '12 at 23:58
  • @GeraldKaszuba So what is the desired behavior after encoding? Agree that it is an interesting problem :) – RocketDonkey Nov 23 '12 at 01:06
  • Basically the encoded string has to store all the information required to be able to decode it, e.g. saving it into a file. In your example you're storing a variable in Python with "extra" information to help it decode, but that information can't be used when another process tries to decode the file. I hope that explains a bit more :) – gak Nov 23 '12 at 01:46
  • @GeraldKaszuba Ah gotcha, that makes sense. And you can only store the string itself? Extending the above example, I assume it isn't feasible to encode the string and write it to a tuple with its corresponding index of newline positions? Or does it have to be a string and the information has to be encoded in it somehow? – RocketDonkey Nov 23 '12 at 02:00
  • It might be possible to save the list into the string, but then you would have to decode that information too, while differentiating it from the other part of the encoded string. For example, the first byte might be the number of `\n`'s, the next set of bytes would be a `struct.pack` of integers of that length, then the rest would be the encoded string as in your code example. It seems this might be more complicated than it should be :) – gak Nov 24 '12 at 19:42
0

If you are encoding an alphabet of n symbols (e.g. ASCII) into a smaller set of m symbols (e.g. ASCII except newline) you must allow the encoded string to be longer than the original string.

The typical way of doing this is to define one character as an "escape" character; the character following the "escape" represents an encoded character. This technique has been used since the 1940s in teletypewriters; that's where the "Esc" key you see on your keyboard came from.

Python (and other languages) already provide this in strings with the backslash character. Newlines are encoded as '\n' (or '\r\n'). Backslashes escape themselves, so the literal string '\r\n' would be encoded '\\r\\n'.

Note that the encoded length of a string that includes only the escaped character will be double that of the original string. If that is not acceptable you will have to use an encoding that uses a larger alphabet to avoid the escape characters (which may be longer than the original string) or compress it (which may also be longer than the original string).

Dour High Arch
  • 21,513
  • 29
  • 75
  • 90
0

How about:

In [8]: import urllib

In [9]: original = 'binary\ndata'

In [10]: encoded = urllib.quote(original)

In [11]: encoded
Out[11]: 'binary%0Adata'

In [12]: urllib.unquote(encoded)
Out[12]: 'binary\ndata'
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • `urllib.quote` escapes other characters. The question is specifically for just escaping `\n`. – gak Nov 24 '12 at 19:35
0

The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.

Ivan Kosarev
  • 304
  • 1
  • 7