0

I'm trying to parse specific child nodes from a JSON file using Python.

I know similar questions have been asked and answered before, but I simply haven't been able to translate those solutions to my own problem (disclaimer: I'm not a developer).

This is the beginning of my JSON file (each new "entry" starts at "_index"):

{
"took": 83,
"timed_out": false,
"_shards": {
    "total": 3,
    "successful": 3,
    "failed": 0
},
"hits": {
    "total": 713628,
    "max_score": 1.3753585,
    "hits": [{
        "_index": "offentliggoerelser-prod-20161006",
        "_type": "offentliggoerelse",
        "_id": "urn:ofk:oid:5135592",
        "_score": 1.3753585,
        "_source": {
            "cvrNummer": 89986915,
            "indlaesningsId": "AUzWhUXw3pscZq1LGK_z",
            "sidstOpdateret": "2015-04-20T10:53:09.154Z",
            "omgoerelse": false,
            "regNummer": null,
            "offentliggoerelsestype": "regnskab",
            "regnskab": {
                "regnskabsperiode": {
                    "startDato": "2014-01-01",
                    "slutDato": "2014-12-31"
                }
            },
            "indlaesningsTidspunkt": "2015-04-20T11:10:53.529Z",
            "sagsNummer": "X15-AA-66-TA",
            "dokumenter": [{
                "dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzdlL2I5L2U2LzlkLzIxN2EtNDA1OC04Yjg0LTAwZGJlNzUwMjU3Yw.pdf",
                "dokumentMimeType": "application/pdf",
                "dokumentType": "AARSRAPPORT"
            }, {
                "dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzk0LzNlL2RjL2Q4L2I1NjUtNGJjZC05NzJmLTYyMmE4ZTczYWVhNg.xhtml",
                "dokumentMimeType": "application/xhtml+xml",
                "dokumentType": "AARSRAPPORT"
            }, {
                "dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzc5LzM3LzUwLzMxL2NjZWQtNDdiNi1hY2E1LTgxY2EyYjRmOGYzMw.xml",
                "dokumentMimeType": "application/xml",
                "dokumentType": "AARSRAPPORT"
            }],
            "offentliggoerelsesTidspunkt": "2015-04-20T10:53:09.075Z"
        }
    },

More specifically, I'm trying to extract all "dokumentUrl" where "dokumentMimeType" is equal to "application/xhtml+xml".

When I use something simple like this:

import json
from pprint import pprint

with open('output.json') as data_file:    
    data = json.load(data_file)

pprint(data['hits']['hits'][0]['_source']['dokumenter'][1]['dokumentUrl'])

I get the first URL that matches my criteria. But how do I create a list of all URLs (all 713.628 of them) from the file with the criteria mentioned above and export it to a CSV file?

I should probably mention that my end goal is to create a program that can loop scrape my list of URLs (I'll save that for another post!).

chrlo
  • 57
  • 1
  • 7
  • `data['hits']['hits'][0]['_source']['dokumenter']` returns a list that you can iterate through, so something like `for item in data['hits']['hits'][0]['_source']['dokumenter']: print item['dokumentUrl']`. Does that do what you want to do (except with the check `item['dokumentMimeType'] == "application/xhtml+xml"` before printing)? – roganjosh Sep 14 '17 at 14:54

1 Answers1

1

Hopefully I am understand this right, and @roganjosh has a similar idea. You can loop through the specific parts with contain lists of useful things. So, we can do something like:

myURL = []
hits = data['hits']['hits']
for hit in hits:
    // Making the assumption here that you want all of the URLs associated with a given document
    document = hit['_source']['dokumenter']
    for url in document:
        if url['dokumentMimeType'] == "application/xhtml+xml":
            myURL.append(url['dokumentUrl'])

Again, I am hoping that I understand your JSON schema enough that this does what you want it to. At least it should get you close.

Also just saw another part of your question regarding CSV outputting.

BrandonM
  • 390
  • 4
  • 12
  • Thank you very much! This fixes my problem. And thank you for the link to CSV outputting. I got it to write my output to a CSV file, but I'm having some trouble getting it to write the URLs in a single column, instead of a single row. Does this have something to do with my output or does the writer have a built in function to transpose the data? – chrlo Sep 15 '17 at 07:19
  • 1
    Nevermind. Found a solution. Thanks again for your help! – chrlo Sep 15 '17 at 07:50
  • Sorry, didn't see this quick enough. Hope you got your solution functioning! – BrandonM Sep 15 '17 at 14:34