1

I have a very strange data file which I have no idea how to loop through the keys. Here is the file:

(The file is generated from an API server. No way to change the input)

{'client': <object at 0xc0>, 'store': {'name': 'test', 'number': 7, 'modified': '2020-09-11T00:32:56Z', 'id': '0833-f780'}, 're': re.compile('^(http://mysite.tesdt.com)/(.+)$')}

I am trying to extract 'number' from the data. But seems like there is no way. I have tried json.loads, eval(data), or any other combinations to convert it to a native python dict. As you can see below, all these chunks of code did not work:

Try #1:

file = "file.json"
data = file.read() 
parsed = json.loads(data)
print(data)

Error:

AttributeError: 'str' object has no attribute 'read'

Try #2:

with open("file.json", "r") as f:
    data = f.read()
    d = ast.literal_eval(data)
    print(d)

Error:

    {'client': <object at 0xc0>, 'store': {'name': 'test', 'number': 7, 'modified': '2020-09-11T00:32:56Z', 'id': '0833-f780'}, 're': re.compile('^(http://mysite.tesdt.com)/(.+)$')}
               ^
SyntaxError: invalid syntax

Try #3:

with open("file.json", "r") as f:
    data = f.read()
    data = data.replace("'", '"')
    print(data)
    js = json.loads(data)
    print(js)

Error:

json.decoder.JSONDecodeError: Expecting value: line 1 column 12 (char 11)

Try #4:

with open("file.json", "r") as f:
    data = f.read()
    data = str(data)
    print(json.dumps(data))
    js = json.loads(data)
    print(js)

Error:

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
wjandrea
  • 28,235
  • 9
  • 60
  • 81
user15109593
  • 105
  • 5
  • No point mentioning Try #1 at all since you simply forgot to call `open()` or use `.load()` instead of `.loads()`. – wjandrea Aug 30 '22 at 14:01
  • 1
    this string is certainly not valid json, so I wouldn't try to use json here – Nik Aug 30 '22 at 14:04
  • 1
    Where did you get the file? Why do you need to evaluate it? Seems like `` will be meaningless without any context. – wwii Aug 30 '22 at 14:05
  • Im just trying everything to make it work..so I just used different things – user15109593 Aug 30 '22 at 14:06
  • 1
    The problem here seems to be that `ast` is unable to parse the values of the `client` and `re` keys. Where did you get this input? Can it be changed to a standard serialization format like pickle or json? Blindly throwing things at the problem to see what sticks is hardly a useful solution – Pranav Hosangadi Aug 30 '22 at 14:07
  • The input is coming from an internal resource. It's an API server. I just changed the values for privacy. But it is not possible to change it. – user15109593 Aug 30 '22 at 14:08
  • 1
    FWIW, `eval()` will work if you replace the value of the `client` key with an actual python expression, but `eval` is really a last resort – Pranav Hosangadi Aug 30 '22 at 14:09
  • I don't really mind which method to use. All I need is the number. – user15109593 Aug 30 '22 at 14:11
  • 2
    This isn't even a data file, it's the representation of a Python dict. It's not valid Python source code (which means there's no way to get AST to parse it). So either you're using the API incorrectly or it's returning invalid data. Possibly this is an [XY problem](https://meta.stackexchange.com/q/66377/343832) and you should actually be focusing on why the data is broken. Like maybe you're retrieving the data and printing it to a file instead of serializing it, like with Pickle. – wjandrea Aug 30 '22 at 14:15
  • 1
    The problem is at the other end. I would probably just search for 'number': , – Kenny Ostrom Aug 30 '22 at 14:20
  • 1
    if you really just want the number you can read the file as text and then regex search for the string between "number: " and the next comma (as long as the value of number can't have a comma) – Nik Aug 30 '22 at 14:21
  • This data file has too many attributes. I just made it simple here. Otherwise, I need lot of data from the input. – user15109593 Aug 30 '22 at 14:25
  • BTW, why does it say just "object"? I'm not sure if I've seen something like that before. If you do `object()` or `zip()` for example, you get "object object" or "zip object". – wjandrea Aug 30 '22 at 14:26
  • That's what I thought you might say. Yeah, I guess you're going to have to fix the repr to reconstruct the actual dict, like Pranav is trying to do. – Kenny Ostrom Aug 30 '22 at 14:31

2 Answers2

0

That is not JSON, it looks like it's a repr of a python dictionary. eval would work if not for the <object at 0xc0> portion. You could try getting rid of that and then try eval.

Note that eval is quite unsafe, and only acceptable if you control where the input comes from, and are sure it won't contain anything malicious.

import re

>>> data = """{'client': <object at 0xc0>, 'store': {'name': 'test', 'number': 7, 'modified': '2020-09-11T00:32:56Z', 'id': '0833-f780'}, 're': re.compile('^(http://mysite.tesdt.com)/(.+)$')}"""


>>> data_cleaned = re.sub(r"(<[^>]+>)", r"'\1'", data)
"{'client': '<object at 0xc0>', 'store': {'name': 'test', 'number': 7, 'modified': '2020-09-11T00:32:56Z', 'id': '0833-f780'}, 're': re.compile('^(http://mysite.tesdt.com)/(.+)$')}"

The regex (<[^>]+>) matches and captures anything between < and >, and the re.sub call encloses it in quotes to make it a string.

>>> d = eval(data_cleaned)
{'client': '<object at 0xc0>',
 'store': {'name': 'test',
  'number': 7,
  'modified': '2020-09-11T00:32:56Z',
  'id': '0833-f780'},
 're': re.compile(r'^(http://mysite.tesdt.com)/(.+)$', re.UNICODE)}

>>> d['store']['number']
7

Of course, if all you care about is the value of number, then just do:

>>> number = [float(x) for x in re.findall(r"'number': (\d+\.?\d*)", data)]
>>> number[0]
7.0
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
  • Traceback (most recent call last): File "test.py", line 5, in data_cleaned = re.sub(r"<([^>]+)>", r"'\1'", data) File "/usr/lib64/python3.7/re.py", line 194, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or bytes-like object – user15109593 Aug 30 '22 at 14:15
  • `data` needs to contain the contents of your file. See https://www.tutorialkart.com/python/python-read-file-as-string/ @user15109593 – Pranav Hosangadi Aug 30 '22 at 14:17
  • File "test.py", line 8, in d = eval(data_cleaned) ValueError: source code string cannot contain null bytes – user15109593 Aug 30 '22 at 14:20
  • @user15109593 apologies, I was playing with the `re.sub` call and forgot to change it back. It needs to be `data_cleaned = re.sub(r"(<[^>]+>)", r"'\1'", data)` – Pranav Hosangadi Aug 30 '22 at 14:21
  • it prints the whole data, with lot of space at the end. like 20 lines. And then SyntaxError: invalid syntax – user15109593 Aug 30 '22 at 14:23
  • This is the whole code: https://www.online-python.com/Wh0kPlQVHv – user15109593 Aug 30 '22 at 14:23
  • @user15109593 like I said in my comment, the `re.sub` call needs to substitute `\1`. Also I can't run that code because it can't find the file – Pranav Hosangadi Aug 30 '22 at 14:24
  • Tried with both 0 and 1 – user15109593 Aug 30 '22 at 14:25
  • with 0 I get: ValueError: source code string cannot contain null bytes ----- with 1 I get lot of spaces at the end..with invalid syntax – user15109593 Aug 30 '22 at 14:26
  • I don't get that with the input you've provided. Please edit your question to provide an example input that is representative of your actual input. – Pranav Hosangadi Aug 30 '22 at 14:27
  • Can you please use your code on your machine? and use the data I provided as a json or text file? not with interactive python? – user15109593 Aug 30 '22 at 14:28
  • 1
    Also consider ast.literal_eval to avoid those security concerns. https://stackoverflow.com/questions/15197673/using-pythons-eval-vs-ast-literal-eval – Kenny Ostrom Aug 30 '22 at 14:33
  • @user15109593 The code remains _exactly_ the same when you do it in a python script. I only showed it as an interactive session to demonstrate what each line does. Before you do this, you'll have to read the contents of the file into a variable called `data`, which is covered extensively in other tutorials / questions, _and you've done it yourself in your tries #2, 3, and 4_, so I didn't repeat it here. – Pranav Hosangadi Aug 30 '22 at 14:53
0

This is quite dirty but if you really just want the value of number, and if number is enclosed between 'number:' and a comma, you could do this:

import re


with open("file.json", "r") as f:
    s = f.read()

result = re.search(r"'number': (.*?),", s)
r = result.group(1)

print(r)

You might need checks for all sorts of cases e.g. "number" is not in your text or the value of "number" has a comma in it.

Does someone now how to improve the regex such that it captures the text before the next comma?

Nik
  • 1,093
  • 7
  • 26
  • The input you see here is just for 1 device. There is 100 more devices like this which are all in 1 file. So I need a robust way to loop through all data. I just said I need the number but there is more data I need to extract. Just for simplicity I said Number only. – user15109593 Aug 30 '22 at 14:28
  • 1
    *"Does someone [know] how to improve the regex such that it captures the text before the next comma?"* -- Use non-greedy matching: `r"'number': (.*?),"` – wjandrea Aug 30 '22 at 14:35