How to read list of JSON formatted dictionaries of Japanese and not insert new characters?

Question

I'm scraping data from ~500 .js files, all of them are formatted like this:

dict[0]=[{"some_key": "<b>名詞</b>", "another_key": "modification"}, {"some_key": "<b>名詞</b>", "another_key": "idea"}]

My code looks like this:

my_file = open(filename, 'r',encoding='utf-8', errors='ignore')
obj = my_file.read()
try:
    my_indexer_left = obj.replace('[', 'xxx', 1).find('[')
    my_indexer_right = obj.rfind(']')
    new_obj = obj[my_indexer_left:my_indexer_right+1]

And after this new_obj is created I can't convert it out of a string.

I tried list(new_obj):

new_list_obj = list(new_obj)
for item in new_list_obj:
    print(item)

And while print(type(new_list_obj)) tells me list, the print statement prints out one character at a time.

I've tried several other things along these lines to get this to work.

The closest I came was referencing the answer here to come up with the following:

j = json.dumps(new_obj,ensure_ascii=False).encode('utf8').decode()

But when I print(j) all of the quotation marks (") are turned into \" and when I print(type(j)) it says str.

I want to be able to read these files, iterate over all the dictionary (json) objects and access the keys and values.

is the file a json file? if so, you can read it into python using the `json` module. — James, Nov 14 '19 at 11:08
@James, it's in a JavaScript file and I have been trying to read it using the `json` module but for some reason it stays as a `string` and I can't get it to turn into a `dict` — Programming_Learner_DK, Nov 14 '19 at 11:10
JavaScript is not the same as JSON. It will have lines of programming code that will not be parsable as JSON. Can you post the file? — James, Nov 14 '19 at 11:12
Possible duplicate of [How to parse data in JSON?](https://stackoverflow.com/questions/7771011/how-to-parse-data-in-json) — mkrieger1, Nov 14 '19 at 11:18
@mkrieger1, author has JSON nested in a .js file. This is a bit more complicated than just parsing a JSON file. — James, Nov 14 '19 at 11:19
@mkrieger1 I saw the question you referenced as a duplicate when researching how to accomplish this and it didn't solve the issue. — Programming_Learner_DK, Nov 14 '19 at 11:21
using list "[" or "]" in string to manipulate as string make me cry — Wonka, Nov 14 '19 at 11:22

score 1 · Answer 1 · answered Nov 14 '19 at 11:45

Judging from the example file you uploaded, it can be done as follows in two simple steps:

Strip dict[i]= prefix and ; suffix from file contents (using a regular expression to generalize i).
Parse resulting data as JSON.

import json
import re

def parse_file(filename):
    with open(filename) as f:
        data = f.read()

    json_text = re.match(r'dict\[\d+\]=(.*);', data).group(1)
    return json.loads(json_text)

score 0 · Accepted Answer · answered Nov 14 '19 at 11:43

Looking at the file you posted, it is JavaScript file that assigns a JSON array to the 0 index of a variable called dict. My guess is that each of the files is assigned to the next index of dict.

You can try to parse this by reading in the file as a string, strip off the extra JavaScript pieces, split on the first = sign, and then pass the rest to the json.loads function.

import json
with open('000.js', encoding='utf-8') as fp:
    raw_str = fp.read()
    raw_str = raw_str.strip().strip(';')
    raw_str = raw_str.split('=', 1)[-1]
    data = json.loads(raw_str)

How to read list of JSON formatted dictionaries of Japanese and not insert new characters?

2 Answers2