0

I'm scraping data from ~500 .js files, all of them are formatted like this:

dict[0]=[{"some_key": "<b>名詞</b>", "another_key": "modification"}, {"some_key": "<b>名詞</b>", "another_key": "idea"}]

My code looks like this:

my_file = open(filename, 'r',encoding='utf-8', errors='ignore')
obj = my_file.read()
try:
    my_indexer_left = obj.replace('[', 'xxx', 1).find('[')
    my_indexer_right = obj.rfind(']')
    new_obj = obj[my_indexer_left:my_indexer_right+1]

And after this new_obj is created I can't convert it out of a string.

I tried list(new_obj):

new_list_obj = list(new_obj)
for item in new_list_obj:
    print(item)

And while print(type(new_list_obj)) tells me list, the print statement prints out one character at a time.

I've tried several other things along these lines to get this to work.

The closest I came was referencing the answer here to come up with the following:

j = json.dumps(new_obj,ensure_ascii=False).encode('utf8').decode()

But when I print(j) all of the quotation marks (") are turned into \" and when I print(type(j)) it says str.

I want to be able to read these files, iterate over all the dictionary (json) objects and access the keys and values.

Programming_Learner_DK
  • 1,509
  • 4
  • 23
  • 49
  • is the file a json file? if so, you can read it into python using the `json` module. – James Nov 14 '19 at 11:08
  • @James, it's in a JavaScript file and I have been trying to read it using the `json` module but for some reason it stays as a `string` and I can't get it to turn into a `dict` – Programming_Learner_DK Nov 14 '19 at 11:10
  • JavaScript is not the same as JSON. It will have lines of programming code that will not be parsable as JSON. Can you post the file? – James Nov 14 '19 at 11:12
  • Possible duplicate of [How to parse data in JSON?](https://stackoverflow.com/questions/7771011/how-to-parse-data-in-json) – mkrieger1 Nov 14 '19 at 11:18
  • 1
    @mkrieger1, author has JSON nested in a .js file. This is a bit more complicated than just parsing a JSON file. – James Nov 14 '19 at 11:19
  • @mkrieger1 I saw the question you referenced as a duplicate when researching how to accomplish this and it didn't solve the issue. – Programming_Learner_DK Nov 14 '19 at 11:21
  • using list "[" or "]" in string to manipulate as string make me cry – Wonka Nov 14 '19 at 11:22

2 Answers2

1

Judging from the example file you uploaded, it can be done as follows in two simple steps:

  1. Strip dict[i]= prefix and ; suffix from file contents (using a regular expression to generalize i).
  2. Parse resulting data as JSON.
import json
import re

def parse_file(filename):
    with open(filename) as f:
        data = f.read()

    json_text = re.match(r'dict\[\d+\]=(.*);', data).group(1)
    return json.loads(json_text)
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
0

Looking at the file you posted, it is JavaScript file that assigns a JSON array to the 0 index of a variable called dict. My guess is that each of the files is assigned to the next index of dict.

You can try to parse this by reading in the file as a string, strip off the extra JavaScript pieces, split on the first = sign, and then pass the rest to the json.loads function.

import json
with open('000.js', encoding='utf-8') as fp:
    raw_str = fp.read()
    raw_str = raw_str.strip().strip(';')
    raw_str = raw_str.split('=', 1)[-1]
    data = json.loads(raw_str)
James
  • 32,991
  • 4
  • 47
  • 70