I'm trying to parse specific child nodes from a JSON file using Python.
I know similar questions have been asked and answered before, but I simply haven't been able to translate those solutions to my own problem (disclaimer: I'm not a developer).
This is the beginning of my JSON file (each new "entry" starts at "_index"):
{
"took": 83,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"failed": 0
},
"hits": {
"total": 713628,
"max_score": 1.3753585,
"hits": [{
"_index": "offentliggoerelser-prod-20161006",
"_type": "offentliggoerelse",
"_id": "urn:ofk:oid:5135592",
"_score": 1.3753585,
"_source": {
"cvrNummer": 89986915,
"indlaesningsId": "AUzWhUXw3pscZq1LGK_z",
"sidstOpdateret": "2015-04-20T10:53:09.154Z",
"omgoerelse": false,
"regNummer": null,
"offentliggoerelsestype": "regnskab",
"regnskab": {
"regnskabsperiode": {
"startDato": "2014-01-01",
"slutDato": "2014-12-31"
}
},
"indlaesningsTidspunkt": "2015-04-20T11:10:53.529Z",
"sagsNummer": "X15-AA-66-TA",
"dokumenter": [{
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzdlL2I5L2U2LzlkLzIxN2EtNDA1OC04Yjg0LTAwZGJlNzUwMjU3Yw.pdf",
"dokumentMimeType": "application/pdf",
"dokumentType": "AARSRAPPORT"
}, {
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzk0LzNlL2RjL2Q4L2I1NjUtNGJjZC05NzJmLTYyMmE4ZTczYWVhNg.xhtml",
"dokumentMimeType": "application/xhtml+xml",
"dokumentType": "AARSRAPPORT"
}, {
"dokumentUrl": "http://regnskaber.virk.dk/51968998/ZG9rdW1lbnRsYWdlcjovLzAzLzc5LzM3LzUwLzMxL2NjZWQtNDdiNi1hY2E1LTgxY2EyYjRmOGYzMw.xml",
"dokumentMimeType": "application/xml",
"dokumentType": "AARSRAPPORT"
}],
"offentliggoerelsesTidspunkt": "2015-04-20T10:53:09.075Z"
}
},
More specifically, I'm trying to extract all "dokumentUrl" where "dokumentMimeType" is equal to "application/xhtml+xml".
When I use something simple like this:
import json
from pprint import pprint
with open('output.json') as data_file:
data = json.load(data_file)
pprint(data['hits']['hits'][0]['_source']['dokumenter'][1]['dokumentUrl'])
I get the first URL that matches my criteria. But how do I create a list of all URLs (all 713.628 of them) from the file with the criteria mentioned above and export it to a CSV file?
I should probably mention that my end goal is to create a program that can loop scrape my list of URLs (I'll save that for another post!).