4

I have string data that look like bytes reprs of JSON in Python

>>> data = """b'{"a": 1, "b": 2}\n'"""

So on the inside of that, we have valid JSON that looks like it's been byte-encoded. I want to decode the bytes and loads the JSON on the inside, but since its a string I cannot.

>>> data.decode() # nope
AttributeError: 'str' object has no attribute 'decode'

Encoding the string doesn't seem to help either:

>>> data.encode() # wrong
b'b\'{"a": 1, "b": 2}\n\''

There are oodles of string-to-bytes questions on stackoverflow, but for the life of me I cannot find anything about this particular issue. Does anyone know how this can be accomplished?

Things that I do not want to do and/or will not work:

  1. eval the data into a bytes object
  2. strip the b and \n (inside of my JSON there's all sorts of other escaped data).

This is the only working solution I have found, and there is a lot not to like about it:

from ast import literal_eval

data = """b'{"a": 1, "b": 2}\n'"""
print(literal_eval(data[:-2] + data[-1:]).decode('utf-8'))
Nolan Conaway
  • 2,639
  • 1
  • 26
  • 42
  • 1
    "I have string data that look like bytes reprs of JSON in Python" - that sounds like a bug you should fix on the producing end. – user2357112 Dec 16 '20 at 19:14
  • 1
    I wish i could solve it on that end! These are actually airflow logs vis structlog that i must analyze – Nolan Conaway Dec 16 '20 at 19:18
  • 1
    "These are actually airflow logs vis structlog that i must analyze" - in the future, you can probably configure structlog to give you something more useful. Avoiding textual log parsing is supposed to be one of the primary goals of structlog. – user2357112 Dec 16 '20 at 19:23
  • 2
    Anyway, `ast.literal_eval`. There's probably a good dupe target somewhere around here. – user2357112 Dec 16 '20 at 19:27
  • Yes, the ast idea looks better. If you can find a partial eval for this... – mkiever Dec 16 '20 at 19:31
  • ast _will_ work but the data need to be sanitized first (note that newline breaks everything) – Nolan Conaway Dec 16 '20 at 19:37
  • 2
    The weird slicing you had to do in your `literal_eval` attempt is almost certainly due to a bug you introduced while attempting to write a string literal for `data` - you've got an actual newline in the middle of your bytes literal, which is invalid syntax for a bytes literal. You probably meant for that to be an actual backslash and n - that, or the newline was supposed to be outside the bytes literal. – user2357112 Dec 16 '20 at 19:39
  • 2
    `data = r"""b'{"a": 1, "b": 2}\n'"""` is likely more representative of the actual kinds of values you're working with. If it's *not*, then that's going to be an issue. – user2357112 Dec 16 '20 at 19:41
  • 1
    If the text in your log file looks like `b'{"a": 1, "b": 2}\n'`, then you can `ast.literal_eval` that with no issue. If the text in your log file looks like `b'{"a": 1, "b": 2}` and then `'` on the next line, then your string literal accurately represents that, and you're going to have issues. – user2357112 Dec 16 '20 at 19:52

1 Answers1

2

I know you said you didn't want to strip the b inside the string due to other escaped data, but can't we assume that whatever generated this only output ascii (hence the b), and we can re-encode that. So I was thinking you can use a simple regexp (https://regex101.com/r/M0ratk/1) which you then encode as bytes.

import json
import re

match = re.match(r"\Ab'(.*)'\Z", data, re.DOTALL)
data = json.loads(bytes(match[1], 'ascii'))

Will this work? I am not sure how it compares to the literal_eval solution.

Ryan
  • 2,073
  • 1
  • 19
  • 33