Parse JSON structures in a txt file containing JSON and text structures

Question

I have a txt file with json structures. the problem is the file does not only contain json structures but also raw text like log error:

2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 : 
{
"name": "1111",
"results": [{
    "filename": "xxxx",
    "numberID": "7412"
}, {
    "filename": "xgjhh",
    "numberID": "E52"
}]
}

2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "jfkjgjkf",
"results": [{
    "filename": "hhhhh",
    "numberID": "478962"
}, {
    "filename": "jkhgfc",
    "number": "12544"
}]
}

I read the .txt file but trying to patch the jason structures I have an error: IN :

import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
   json_data = json.load(f)

OUT : json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)

I would like to parce json and save as csv file.

Does every non-json line start with a date and time? You could use a regex to find all the lines that start with `"{number}-{number}-{number} "` and pass all the lines between those to [`json.loads()`](https://docs.python.org/3/library/json.html#json.loads) — Boris Verkhovskiy, Jan 30 '19 at 17:41
The first line is datetime and between each json structure it starts with the character | then a datetime, like this : |2019-01-18 21:00:11.7022|INFO|Technical|Got Entity Profile for 372245 in 0.43s| 2019-01-18 21:00:11.8897|INFO|Technical|Got the following profile for the entity 372514: { and a nother json structure start ... — Ib D, Jan 30 '19 at 17:46

blhsing · Answer 1 · 2019-01-30T18:25:23.793

A more general solution to parsing a file with JSON objects mixed with other content without any assumption of the non-JSON content would be to split the file content into fragments by the curly brackets, start with the first fragment that is an opening curly bracket, and then join the rest of fragments one by one until the joined string is parsable as JSON:

import re

fragments = iter(re.split('([{}])', f.read()))
while True:
    try:
        while True:
            candidate = next(fragments)
            if candidate == '{':
                break
        while True:
            candidate += next(fragments)
            try:
                print(json.loads(candidate))
                break
            except json.decoder.JSONDecodeError:
                pass
    except StopIteration:
        break

This outputs:

{'name': '1111', 'results': [{'filename': 'xxxx', 'numberID': '7412'}, {'filename': 'xgjhh', 'numberID': 'E52'}]}
{'name': 'jfkjgjkf', 'results': [{'filename': 'hhhhh', 'numberID': '478962'}, {'filename': 'jkhgfc', 'number': '12544'}]}

score 1 · Answer 2 · answered Jan 30 '19 at 17:52

1

You could do one of several things:

On the Command Line, remove all lines where, say, "|INFO|Technical|" appears (assuming this appears in every line of raw text):
sed -i '' -e '/\|INFO\|Technical/d' yourfilename (if on Mac),
sed -i '/\|INFO\|Technical/d' yourfilename (if on Linux).
Move these raw lines into their own JSON fields

answered Jan 30 '19 at 17:52

yaken

559
4
17

1

I would imagine using `sed` as a preprocessing step before reading the json in python is going to be significantly more performant than doing everything in python. – PMende Jan 30 '19 at 18:19
1

By removing all known non-JSON content, the rest of the file would not be parsable as a valid JSON object because it would actually become several JSON objects. Also note that the concept would only work if the non-JSON content follows a known and consistent pattern. – blhsing Jan 30 '19 at 18:31
1

You still have to read multiple json objects from a single file, which is [kind of a pain](https://stackoverflow.com/questions/27907633/multiple-json-objects-in-one-file-extract-by-python/27907893). – Boris Verkhovskiy Jan 30 '19 at 18:32

Chris Larson · Accepted Answer · 2019-01-30T22:42:00.773

This solution will strip out the non-JSON structures, and wrap them in a containing JSON structure.This should do the job for you. I'm posting this as is for expediency, then I'll edit my answer for a more clear explanation. I'll edit this first bit when I've done that:

import json

with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
    cleaned = ''.join([item.strip() if item.strip() is not '' else '-split_here-' for item in f.readlines() if '|INFO|' not in item]).split('-split_here-')

json_data = json.loads(json.dumps(('{"entries":[' + ''.join([entry + ', ' for entry in cleaned])[:-2] + ']}')))

Output:

{"entries":[{"name": "1111","results": [{"filename": "xxxx","numberID": "7412"}, {"filename": "xgjhh","numberID": "E52"}]}, {"name": "jfkjgjkf","results": [{"filename": "hhhhh","numberID": "478962"}, {"filename": "jkhgfc","number": "12544"}]}]}

What's going on here?

In the cleaned = ... line, we're using a list comprehension that creates a list of the lines in the file (f.readlines()) that do not contain the string |INFO| and adds the string -split_here- to the list whenever there's a blank line (where .strip() yields '').

Then, we're converting that list of lines (''.join()) into a string.

Finally we're converting that string (.split('-split_here-') into a list of lists, separating the JSON structures into their own lists, marked by blank lines in data.txt.

In the json_data = ... line, we're appending a ', ' to each of the JSON structures using a list comprehension.

Then, we convert that list back into a single string, stripping off the last ', ' (.join()[:-2]. [:-2]slices of the last two characters from the string.).

We then wrap the string with '{"entries":[' and ']}' to make the whole thing a valid JSON structure, and feed it to json.dumps and json.loads to clean any encoding and load your data a a python object.

Boris Verkhovskiy · Answer 4 · 2019-01-30T20:00:26.697

Use the "text structures" as a delimiter between JSON objects.

Iterate over the lines in the file, saving them to a buffer until you encounter a line that is a text line, at which point parse the lines you've saved as a JSON object.

import re
import json

def is_text(line):
    # returns True if line starts with a date and time in "YYYY-MM-DD HH:MM:SS" format
    line = line.lstrip('|') # you said some lines start with a leading |, remove it
    return re.match("^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})", line)

json_objects = []

with open("data.txt") as f:
    json_lines = []

    for line in f:
        if not is_text(line):
            json_lines.append(line)
        else:
            # if there's multiple text lines in a row json_lines will be empty
            if json_lines:
                json_objects.append(json.loads("".join(json_lines)))
                json_lines = []

    # we still need to parse the remaining object in json_lines
    # if the file doesn't end in a text line
    if json_lines:
        json_objects.append(json.loads("".join(json_lines)))

print(json_objects)

Repeating logic in the last two lines is a bit ugly, but you need to handle the case where the last line in your file is not a text line, so when you're done with the for loop you need parse the last object sitting in json_lines if there is one.

I'm assuming there's never more than one JSON object between text lines and also my regex expression for a date will break in 8,000 years.

score -1 · Answer 5 · answered Jan 30 '19 at 17:46

You could count curly brackets in your file to find beginning and ending of your jsons, and store them in list, here found_jsons.

import json

open_chars = 0
saved_content = []

found_jsons = []

for i in content.splitlines():
    open_chars += i.count('{')

    if open_chars:
        saved_content.append(i)

    open_chars -= i.count('}')


    if open_chars == 0 and saved_content:
        found_jsons.append(json.loads('\n'.join(saved_content)))
        saved_content = []


for i in found_jsons:
    print(json.dumps(i, indent=4))

Output

{
    "results": [
        {
            "numberID": "7412",
            "filename": "xxxx"
        },
        {
            "numberID": "E52",
            "filename": "xgjhh"
        }
    ],
    "name": "1111"
}
{
    "results": [
        {
            "numberID": "478962",
            "filename": "hhhhh"
        },
        {
            "number": "12544",
            "filename": "jkhgfc"
        }
    ],
    "name": "jfkjgjkf"
}

Note that this will work only if the non-JSON lines do not contain curly brackets. — blhsing, Jan 30 '19 at 17:49
This will also break if the JSON contains a string with an unmatched a curly bracket. Which [is possible in the `filename` field](https://unix.stackexchange.com/questions/230291/what-characters-are-valid-to-use-in-filenames). — Boris Verkhovskiy, Jan 30 '19 at 19:53

Parse JSON structures in a txt file containing JSON and text structures

5 Answers5