0

Some hapless coworker saved some data into a file like this:

s = b'The em dash: \xe2\x80\x94'
with open('foo.txt', 'w') as f:
    f.write(str(s))

when they should have used

s = b'The em dash: \xe2\x80\x94'
with open('foo.txt', 'w') as f:
    f.write(s.decode())

Now foo.txt looks like

b'The em-dash: \xe2\x80\x94'

Instead of

The em dash: —

I already read this file as a string:

with open('foo.txt') as f:
    bad_foo = f.read()

Now how can I convert bad_foo from the incorrectly-saved format to the correctly-saved string?

shadowtalker
  • 12,529
  • 3
  • 53
  • 96
  • `.decode` doesn't make sense without an encoding name. Why are you using byte strings in the first place, anyway? The idiomatic way to do this is to use a Unicode string and let Python encode it when writing to a file. – tripleee Dec 11 '18 at 18:32
  • @tripleee someone else did it, and I've been tasked with undoing it :) – shadowtalker Dec 11 '18 at 18:32
  • 1
    I suspect nothing much more useful than `eval` can be suggested for undoing this. – tripleee Dec 11 '18 at 18:32
  • @tripleee this was intended as a self-answer. See https://stackoverflow.com/a/53730411/2954547 – shadowtalker Dec 11 '18 at 18:39
  • Also in Py3 .decode() uses UTF-8 by default. – shadowtalker Dec 11 '18 at 18:40
  • Does `eval(bad_foo).decode()` solves it? Can check it myself right now. – LuckyJosh Dec 11 '18 at 18:44
  • 1
    @shadowtalker there's a "answer your own question" checkbox just below the "Post Your Question" button on the Ask Question page that let's you get your answer in before the competition ;-) – snakecharmerb Dec 11 '18 at 18:44
  • @LuckyJosh `eval` is not safe against malicious input - see answers to [this question](https://stackoverflow.com/q/661084/5320906). `ast.literal_eval` is safer. – snakecharmerb Dec 11 '18 at 18:50
  • @snakecharmerb Yeah, I know that `eval` is unsafe. I guess, I should have included a warning in my comment. But `ast.literal_eval` I did not know about, learned something new today. ;) – LuckyJosh Dec 11 '18 at 19:21

3 Answers3

3

You can try literal eval

from ast import literal_eval
test = r"b'The em-dash: \xe2\x80\x94'"
print(test)
res = literal_eval(test)
print(res.decode())
Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
1

If you trust that the input is not malicious, you can use ast.literal_eval on the broken string.

import ast

# Create a sad broken string
s = "b'The em-dash: \xe2\x80\x94'"

# Parse and evaluate the string as raw Python source, creating a `bytes` object
s_bytes = ast.literal_eval(s)

# Now decode the `bytes` as normal
s_fixed = s_bytes.decode()

Otherwise you will have to manually parse and remove or replace the offending repr'ed escapes.

shadowtalker
  • 12,529
  • 3
  • 53
  • 96
-2

This code is working correct in my computer. But if you still get error, this may help you

with open('foo.txt', 'r', encoding="utf-8") as f:
    print(f.read())
ozcanyarimdunya
  • 2,324
  • 1
  • 18
  • 21