0

I'm looking for an efficient way to translate escape sequences in a string (Unicode) to target characters. The strings are some parsed language strings read from a file that we want to transform according to the rules: (note:the escaping rules are different to those of python itself)

\uxxxx (four hex digits) --> gives the Unicode character with the given code point
\LF \CR \CR+LF  --> '' : a backslash character followed by a line break removes 
                         both of them, where line break is not platform specific.
(For example: "aa\\\nbb", "aa\\\rbb", "aa\\\r\nbb" all gives "aabb")

\f --> FF char
\n --> LF char
\r --> CR char
\t --> TAB char
\C where C is any other *Unicode* character  ---> gives C itself.
  This includes the escaped backslash '\\' sequence, which should be consumed
  first from left to right:

  r'\\\\u0050' --> r'\\u0050'
  r'\\\\\u0050' --> r'\\P'

(Basically these rules are somewhat similar to the escaping rules available in many languages for example Perl and Ruby if I'm not wrong)

(Please note: my usage of raw or normal form of strings in the examples is just for illustration to show how exactly the strings are translated)

Is it possible with such rules to improve on the most naive method of looping through the string and doing lookaheads, appending to a target string in the process.

A somewhat similar question here offers answers based on splitting and re-joining the string, but I don't think that can be applied here because of the successive escapes issue.

Community
  • 1
  • 1
Basel Shishani
  • 7,735
  • 6
  • 50
  • 67
  • I don't understand your second pattern matching example at all. What do you mean by "an escape followed by..." - there is no escape. Also, which Python version? – Tim Pietzcker Sep 26 '13 at 10:31
  • Py3 only is okay. These are not formal regex patterns, just illustrations of the rule. For example: "aa\\\nbb", "aa\\\rbb", "aa\\\r\nbb" all gives "aabb". I'll try to re-phrase. – Basel Shishani Sep 26 '13 at 10:40
  • I see. How are you getting these strings? Are you reading them from a file? If so, how come they are raw strings? If they weren't raw strings, it seems most of the job could be done by a simple `print()`, but that can't quite be it... – Tim Pietzcker Sep 26 '13 at 10:50
  • We're reading from a file. My usage of raw string or normal strings is just for illustration as suitable. – Basel Shishani Sep 26 '13 at 10:54
  • We are parsing a language file and want to apply these translations on parsed strings. – Basel Shishani Sep 26 '13 at 11:00
  • http://stackoverflow.com/questions/10944907/python-unescape-xxx is quite close to what you need, but not exactly – Maxim Razin Sep 27 '13 at 00:59

0 Answers0