0

I have a string returned by Elasticsearch:

\n\nESM Management LLC (\u201cESM\u201d) provides investment

When I print the string, the string is represented exactly as shown above. I've tried decoding, i.e. s.decode('utf8'), but I think there is something fundamental I don't understand about the encoding/decoding process.

How can I convert this string so the new lines are rendered and the unicode codes are converted to the symbols they represent?

This is what I'm looking for:

.>>> s = '\n\nESM Management LLC (\u201cESM\u201d) provides investment'
.>>> s
.
.
.ESM Management LLC ("ESM") provides investment 
wim
  • 338,267
  • 99
  • 616
  • 750
chishaku
  • 4,577
  • 3
  • 25
  • 33
  • Is the output being displayed in a terminal window? If so, what character set is the terminal configured to display? Perhaps it doesn't have the characters that you want, so it's falling back to using the numeric codes. – John Gordon Apr 05 '16 at 19:16
  • Same behavior in iTerm 2.9 and Sublime Text 3. iTerm Character Encoding: Unicode (UTF-8). Report Terminal Type: xterm-256color – chishaku Apr 05 '16 at 19:22
  • Please confirm whether you are actually printing (i.e. using a print statement) or just echoing it in the REPL like in your example. This is relevant to the answer. – wim Apr 05 '16 at 19:29
  • Also, if you have stuff like `\u201cESM\u201d` in a bytestring, then you already screwed up earlier, and have to fix it earlier – wim Apr 05 '16 at 19:34
  • @wim I am printing in a terminal. The REPL representation was just for explanation. Noted your last comment, can you expand a little bit? I'm assuming I should have encoded the text before indexing? The raw input is from pdf extraction and the output went straight into a field in an elasticsearch (json) document. – chishaku Apr 05 '16 at 19:45
  • OK, because in the REPL it uses `__repr__` and there is no sane way to see the special characters without using a pretty-printer for that. Your problem is probably at the indexing stage so we would need to see that code to get to the root cause. – wim Apr 05 '16 at 19:48

1 Answers1

1

Looks like you are using python 2.

  1. Use unicode for such literals.
  2. Encode to stdout encoding to make sure it's printed correctly.

-

import sys

s = u'\n\nESM Management LLC (\u201cESM\u201d) provides investment'
print s.encode(sys.stdout.encoding)


ESM Management LLC (“ESM”) provides investment

If as you say at the bottom it's a bytes string coming from somewhere else you can't use a unicode literal. Decode using 'unicode-scape' instead.

s = '\n\nESM Management LLC (\u201cESM\u201d) provides investment'
print s.decode(encoding='unicode-escape').encode(sys.stdout.encoding)


ESM Management LLC (“ESM”) provides investment

EDIT As @wim explains in the comments encoding as sys.stdout.encoding is probably not needed since print will do it anyway. OTOH additional decoding might be necessary depending on the terminal and shell encodings but I am not sure about what exactly should be done. So I will leave the answer as is since it helped the OP. See this excelent answer for more info on this topic.

Community
  • 1
  • 1
Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56
  • While I recognize my error in not dealing with encoding before indexing the documents, the second solution satisfied my immediate need. Now I don't need to re-index those 200k documents to get results today. Thanks @Goyo. – chishaku Apr 05 '16 at 21:17