0

I have a text file that has data in the following shape:

{"id": 1, {"device_data": 123}, "created_at": "2020-04-03"}{"id": 2, {"device_data": 123}, "created_at": '2020-04-03'}{"id": 2, {"device_data": 123}, "created_at": "2020-04-03"}{"id": 2, {"device_data": 123}, "created_at": '2020-04-03'}

There aren't any \n delimiters or , delimiters that I can use here. I'd like to parse this into a list of dictionaries in order to load the data into a dataframe.

I've tried parsing this using .split() and list comprehension doing something like this:

lst = [x + '}' for x in data.split('}') if x != '']

But this obviously breaks for records that have nested objects.

I also tried doing this with regex but I'm struggling to figure out the appropriate way. This is what I have so far:

re.split('(\{(.*)\})', data) 

Per the suggestions below, I also attempted making use of the json library.

with open('path/to/file', 'r') as f:
    res = json.load(f)

However, this resulted in an error with the following message: JSONDecodeError: Extra data. I believe this is due to the fact that there are multiple valid jsons in this file.

I wanted to use the json.load() command with a for loop, but then I ran into trouble figuring out how to properly split the file contents.

Does anyone have a suggestion for how to approach this kind of problem?

genhernandez
  • 453
  • 1
  • 5
  • 19

3 Answers3

2

Regex does not handle nested formats like this effectively.

This looks a bit like JSON, and Python has the builtin json package, which could help. To use it on this data, you'll need to first convert single quotes to double quotes: data_string.replace("'", '"'). But the format is probably still different enough from JSON to be a problem.

If you know what generated the data, that may help you figure out what will parse the data. Otherwise, this answer explains how to parse nested expressions manually.

dolay
  • 56
  • 4
1

Your data sort of looks like JSON, but with single instead of double quotes.

If that is the case, I would first suggest changing your data (if possible) to just use valid json, and then you can easily do:

myfile.json:

{ "foo": 42 }
import json

with open('myfile.json') as f:
  obj = json.load(f)

print(obj) # {'foo': 42}

Then obj is a valid python dictionary you can use as normal.

If you can't use double-quoted JSON, you could possibly refer to this question about parsing single-quoted JSON.

MHebes
  • 2,290
  • 1
  • 16
  • 29
1

Your record separator is:

}{

so given

txt="{'id': 1, {'device_data': 123}, 'created_at': '2020-04-03'}{'id': 2, {'device_data': 123}, 'created_at': '2020-04-03'}{'id': 2, {'device_data': 123}, 'created_at': '2020-04-03'}{'id': 2, {'device_data': 123}, 'created_at': '2020-04-03'}"

split into records with:

records=txt.split('}{')

The results look like:

records[0]="{'id': 1, {'device_data': 123}, 'created_at': '2020-04-03'"
records[1]="'id': 2, {'device_data': 123}, 'created_at': '2020-04-03'"

and parse the records into dictionary's with

mydictlist = []
for record in records:
    # clean up excess brackets and tokens
    record = record.replace('{','').replace('}','').replace("'",'')
    mydict = dict((k.strip(), v.strip()) for k,v in
          (item.split(':') for item in record.split(',')))
    mydictlist.append(mydict)

Example result looks like:

mydictlist[2] = {'id': '2', 'device_data': '123', 'created_at': '2020-04-03'}
Paul Smith
  • 454
  • 6
  • 11
  • Thanks a million. I went with a diff approach since I wanted to keep the nested objects but the key to getting there was realizing I could use `}{` as my delimiter here. I really appreciate your thoughtful response! – genhernandez May 29 '20 at 23:50