How to solve problem decoding from wrong json format

Question

everyone. Need help opening and reading the file.

Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ

It is a dictionary

{"id0":"url0", "id1":"url1", ..., "idn":"urln"}

But it was written using json into txt file.

#This is how I dump the data into a txt    
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))

So, the file structure is {"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}

And it is all a string....

I need to open it and check repeated ID, delete and save it again.

But getting - json.loads shows ValueError: Extra data

Tried these: How to read line-delimited JSON from large file (line by line) Python json.loads shows ValueError: Extra data json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)

But still getting that error, just in different place.

Right now I got as far as:

with open('111111111.txt', 'r') as log:
    before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')

mu_dic = []
for i in before_log:
    mu_dic.append(i)

This eliminate the problem of several {}{}{} dictionaries/jsons in a row.

Maybe there is a better way to do this?

P.S. This is how the file is made:

json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))

If you can be sure there's no `}{` inside a string, just replace `}{` with `}\n{` and split the lines. — Klaus D., Jan 10 '19 at 04:15

score 0 · Answer 1 · answered Jan 10 '19 at 04:19

0

The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet

with open('json_file.txt') as f:
    # Read complete file
    a = (f.read())

    # Convert into single line string
    b = ''.join(a.splitlines())

    # Add , after each object
    b = b.replace("}", "},")

    # Add opening and closing parentheses and ignore last comma added in prev step
    b = '[' + b[:-1] + ']'

x = json.loads(b)

answered Jan 10 '19 at 04:19

rohit kamal

49
4

Tried your code. Do I get it right that the problem was in several {...}{...} tables in a row? First, we added a " }, " between them, after added a [ ] in the start and in the end. The final structure would be like: [ {},{},{},{}. ] – Andrew Barashev Jan 10 '19 at 05:05
I hope it worked :) Yes, the problem was that the json rows were not `,` separated and since it was supposed to be list of jsons, `[]` also needed to be added – rohit kamal Jan 10 '19 at 06:54

Chiheb Nexus · Accepted Answer · 2019-01-10T05:21:45.717

Your file size is 9,5M, so it'll took you a while to open it and debug it manually. So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:

# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332

So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.

So, here is an example of how you can solve your problem using Python:

import json

input_file = '111111111.txt'
output_file = 'new_file.txt'

data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
    # this with statement part can be replaced by 
    # using sed under your OS like this example:
    # sed -i 's/}{/}\n{/g' 111111111.txt
    data = f_file.read()
    data = data.replace('}{', '}\n{')


seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
    # convert the line to a valid Python dict
    converted = json.loads(elm)
    # loop over the keys
    for key, value in converted.items():
        total_keys += 1
        # if the key is not seen then add it for further manipulations
        # else ignore it
        if key not in seen:
            seen.add(key)
            to_write.update({key: value})

# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
    out_file.write(json.dumps(to_write) + '\n')

print(
    'found duplicated key(s): {seen} from {total}'.format(
        seen=total_keys - len(seen),
        total=total_keys
    )
)

Output:

found duplicated key(s): 43836 from 45367

And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.

Thank you very much. It is working, even that I'm not sure how)) — Andrew Barashev, Jan 10 '19 at 05:09
@AndrewBarashev If my answer fills your need, don't forget to accept it. And also, if you need more clarification don't hesitate to ask. — Chiheb Nexus, Jan 10 '19 at 05:10
Was going to do so, looking into the code atm. Thank you again. — Andrew Barashev, Jan 10 '19 at 05:19
@AndrewBarashev i've updated my code. Now it's more accurate. — Chiheb Nexus, Jan 10 '19 at 05:19
Edited the code a bit for it to make one dictionary first. Think this way it will be a bit more convenient for me in the future. Thanks again. — Andrew Barashev, Jan 10 '19 at 05:55

How to solve problem decoding from wrong json format

2 Answers2