0

I have got a file with the following lines

{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f508f-e7c8-32b8-e044-0003ba298018","municipalityCode":"0766","municipalityName":"Hedensted","streetCode":"0072","streetName":"Værnegården","streetBuildingIdentifier":"13","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"8000","districtName":"Århus","presentationString":"Værnegården 13, 8000 Århus","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(553564 6179299)","x":553564,"y":6179299}]}

I want to transform every line into a csv readable file with headers. Like the following

status,message,data,addressAccessId,municipalityCode,municipalityName,streetCode,streetName,streetBuildingIdentifier,mailDeliverySublocationIdentifier,districtSubDivisionIdentifier,postCodeIdentifier,districtName,presentationString,addressSpecificCount,validCoordinates,geometryWkt,x,y
OK,OK,data:type,addressAccessType,0a3f508f-e7c8-32b8-e044-0003ba298018,0766,Hedensted,0072,Værnegården,13,,,8000,Århus,Værnegården 13, 8000 Århus,1,true,POINT553564 6179299,553564,6179299

How do I accomplish that? Code and explanation are very welcome. So far this is what I have come up with the following from this example:(How can I convert JSON to CSV?)

x = json.loads(x)

f = csv.writer(open('test.csv', 'wb+'))

# Write CSV Header, If you dont need that, remove this line
f.writerow(['status', 'message', 'type', 'addressAccessId', 'municipalityCode','municipalityName','streetCode','streetName','streetBuildingIdentifier','mailDeliverySublocationIdentifier','districtSubDivisionIdentifier','postCodeIdentifier','districtName','presentationString','addressSpecificCount','validCoordinates','geometryWkt','x','y'])


for x in x:
    f.writerow([x['status'], 
                x['message'], 
                x['data']['type'], 
                x['data']['addressAccessId'],
                x['data']['municipalityCode'],
                x['data']['municipalityName'],
                x['data']['streetCode'],
                x['data']['streetName'],
                x['data']['streetBuildingIdentifier'],
                x['data']['mailDeliverySublocationIdentifier'],
                x['data']['districtSubDivisionIdentifier'],
                x['data']['postCodeIdentifier'],
                x['data']['districtName'],
                x['data']['presentationString'],
                x['data']['addressSpecificCount'],
                x['data']['validCoordinates'],
                x['data']['geometryWkt'],
                x['data']['x'],
                x['data']['y']])

I have looked through and tried a lot of other solutions, including DictWriter, replace() and translate() to remove characthers but have not yet been able to transform the line to my need. The purpose being able to select the fields that are output into a new file, and transforming x and y to a new coordinate system. But for now Im just trying to parse the above line to a csv file. Can anyone offer code and explanation of their code? Thank you very much for your time.

Below are the first few lines of my addresses.txt

    {"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5081-e039-32b8-e044-0003ba298018","municipalityCode":"0265","municipalityName":"Roskilde","streetCode":"0831","streetName":"Brønsager","streetBuildingIdentifier":"69","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Svogerslev","postCodeIdentifier":"4000","districtName":"Roskilde","presentationString":"Brønsager 69, 4000 Roskilde","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(690026 6169309)","x":690026,"y":6169309}]}
    {"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5089-ecab-32b8-e044-0003ba298018","municipalityCode":"0461","municipalityName":"Odense","streetCode":"9505","streetName":"Vægtens Kvarter","streetBuildingIdentifier":"271","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Holluf Pile","postCodeIdentifier":"5220","districtName":"Odense SØ","presentationString":"Vægtens Kvarter 271, 5220 Odense SØ","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(592191 6135829)","x":592191,"y":6135829}]}
    {"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f507c-adc3-32b8-e044-0003ba298018","municipalityCode":"0165","municipalityName":"Albertslund","streetCode":"0445","streetName":"Skyttehusene","streetBuildingIdentifier":"33","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"2620","districtName":"Albertslund","presentationString":"Skyttehusene 33, 2620 Albertslund","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(711079 6174741)","x":711079,"y":6174741}]}
    {"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f509c-7f57-32b8-e044-0003ba298018","municipalityCode":"0851","municipalityName":"Aalborg","streetCode":"5205","streetName":"Løvstikkevej","streetBuildingIdentifier":"36","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"9000","districtName":"Aalborg","presentationString":"Løvstikkevej 36, 9000 Aalborg","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(552407 6322490)","x":552407,"y":6322490}]}
    {"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5098-32a6-32b8-e044-0003ba298018","municipalityCode":"0779","municipalityName":"Skive","streetCode":"0462","streetName":"Landevejen","streetBuildingIdentifier":"52","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Håsum","postCodeIdentifier":"7860","districtName":"Spøttrup","presentationString":"Landevejen 52, 7860 Spøttrup","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(491515 6269739)","x":491515,"y":6269739}]}
Community
  • 1
  • 1
Philip
  • 944
  • 11
  • 26

2 Answers2

3

Note that the data key holds a list of dictionaries. x['data']['type'] wouldn't work, but x['data'][0]['type'] would. There might be more than one such dictionary in that list, however. I'll assume you want a CSV row per x['data'] dictionary.

Next, it appears you have a UTF-8 BOM on every line; whatever wrote this was not using UTF-8 encoding correctly. We need to strip this marker, the first 3 characters.

Last, JSON strings are always Unicode data, and you have non-ASCII characters in your data, so you'll have to encode to bytestrings again before passing the data to the CSV writer object.

I'd use csv.DictWriter here, with a pre-defined list of field names:

import codecs
import csv
import json

fields = [
    'status', 'message', 'type', 'addressAccessId', 'municipalityCode', 
    'municipalityName', 'streetCode', 'streetName', 'streetBuildingIdentifier',
    'mailDeliverySublocationIdentifier', 'districtSubDivisionIdentifier',
    'postCodeIdentifier', 'districtName', 'presentationString', 'addressSpecificCount',
    'validCoordinates', 'geometryWkt', 'x', 'y']


with open('test.csv', 'wb') as csvfile, open('jsonfile', 'r') as jsonfile:
    writer = csv.DictWriter(csvfile, fields)
    writer.writeheader()

    for line in jsonfile:
        if line.startswith(codecs.BOM_UTF8):
            line = line[3:]
        entry = json.loads(line)
        for item in entry['data']:
            row = dict(item, status=entry['status'], message=entry['message'])
            row = {k.encode('utf8'): unicode(v).encode('utf8') for k, v in row.iteritems()}
            writer.writerow(row)

The row dictionary is basically a copy of each of the dictionaries in the entry['data'] list, with the status and message keys copied over separately. This makes row a flat dictionary instead.

I also read your input file line by line, as you say that each line contains a separate JSON entry.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • do you want to put that `writer.writerow(row)` in the `for` loop? – colcarroll Feb 26 '14 at 19:58
  • Thank you very much for your detailed answer, it definitely helps a lot. You are right to assume that I have a file with multiple lines where the data that I want is in `x[data]`. However, when I try you code I get the following error `ValueError: No JSON object could be decoded` is that because of the file that contains my json-lines or could it be beacuse the lines are not valid json? – Philip Feb 26 '14 at 19:58
  • @JLLagrange: indeed I do. – Martijn Pieters Feb 26 '14 at 20:00
  • @PhilipHoyos: Does your file have empty lines? Are all JSON objects on separate lines, one per line, or is there one *big* JSON object which is a list of objects, or is it a different format altogether. Can you give us a sample of the first few lines? – Martijn Pieters Feb 26 '14 at 20:01
  • @MartijnPieters every line in the file is identical, and has the same structure. It is not one _big_ JSON file. I have added the first 4 lines of the file to the question. Thank you all for taking your time! – Philip Feb 26 '14 at 20:10
  • @PhilipHoyos: that does look correct. Then I suspect that you have either corrupted lines or empty lines in the file. You could use `try:`, `json.loads(line)`, `except ValueError:`, `print 'skipping line:', line.rstrip()`, `continue` instead of the plain `json.loads()` line. – Martijn Pieters Feb 26 '14 at 20:14
  • @MartijnPieters: Thanks, that got me a bit more information. It showed me that it skips every line in the file. – Philip Feb 26 '14 at 20:26
  • @PhilipHoyos: Every line? Your data, as posted here, works just fine. I've updated the answer to use `repr(line)` instead. What does a skipped line look like when printed like that? – Martijn Pieters Feb 26 '14 at 20:28
  • @MartijnPieters: THat is very strange. I have tried both the full file and the sample and neither works. I will have to investigate in the morning. Thank you very much for your time. – Philip Feb 26 '14 at 20:53
  • @MartijnPieters: Skipped line (edit because of char count): `Unable to parse line: '\xef\xbb\xbf{"status":"OK","message":"OK","data":[{"type":"addressAccessType","municipalityName":"Hedensted","streetCode":"0072","streetName":"V\xc3\xa6rneg\xc3\xa5rden","streetBuildingIdentifier":"13","districtSubDivisionIdentifier":"","postCodeIdentifier":"8000","districtName":"\xc3\x85rhus","presentationString":"V\xc3\xa6rneg\xc3\xa5rden 13, 8000 \xc3\x85rhus","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(553564 6179299)","x":553564,"y":6179299}]}'` – Philip Feb 27 '14 at 05:53
  • I suspect that this is a encode/decode problem. When I try changing the encoding of the file to `ANSI` I get encoded characters like this `(Å=\xc3\x85)`. `UTF-8` gives me this `UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 1: ordinal not in range(128)`. How do I force python to the right encoding? – Philip Feb 27 '14 at 07:35
  • @PhilipHoyos: I missed that your JSON data has non-ASCII data in it; the lines start with a UTF-8 BOM (a Microsoft-ism that should only be at the start of the file, not each line). Updated the answer to handle the BOM and to encode the row before writing. – Martijn Pieters Feb 27 '14 at 10:40
  • @MartijnPieters: Thank you for all your time. It does not quite work yet though. I get the follwoing error: `Traceback (most recent call last): File "C:/DevPython/geo/stack.py", line 23, in row = {k.encode('utf8'): v.encode('utf8') for k, v in row.iteritems()} File "C:/DevPython/geo/stack.py", line 23, in row = {k.encode('utf8'): v.encode('utf8') for k, v in row.iteritems()} AttributeError: 'bool' object has no attribute 'encode'` – Philip Feb 27 '14 at 11:56
  • @PhilipHoyos: Ah, yes, **one** of your entries is a boolean, not a string. We can just make that a string (as the CSV writer will do anyway). Update coming. – Martijn Pieters Feb 27 '14 at 12:08
  • @MartijnPieters: Ah I get the problem now, so when it the next error is `AttributeError: 'int' object has no attribute 'encode'`. Then its because that **one** or **more** entities are `int` and not `string`? So I guess it is possible to either handle every item specifically, or evaluate all items. Is that correct? Thank you very much for your time and effort! Its great learning! – Philip Feb 28 '14 at 05:54
  • @PhilipHoyos: hehe, that's because `x` and `y` are integers, I missed those. – Martijn Pieters Feb 28 '14 at 08:41
  • @PhilipHoyos: actually, I was over-engineering. Corrected by making all values `unicode()` values first before encoding. – Martijn Pieters Feb 28 '14 at 08:44
  • @MartijnPieters: Thank you very much! You have been most helpful! – Philip Feb 28 '14 at 13:54
0

Open the output file with cvs.DictWriter() and define the output header fields as you specified. Use extrasaction='ignore' and restval='' as options.

Look at Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6 for help with processing large files as I had a similar question Also look at the question that I link to.

I build a similar type of system from a JSON using appropriate loops.

for example,

def parse_row(currdata):
  outx = {}
  # currdata is defined earlier to point to the x['data'] dictionary
  for eachx in currdata:
    outx[eachx] = currdata[eachx]
  return outx

where this is in a function with currdata as an argument and called with x['data'][row] as the input argument.

rows = len(x['data'])
for row in range(rows):
  outx = parse_row(x['data'][row])
  # process the row and create output

This should allow you to set up the parsing properly. I cannot copy the actual code into this answer but this should point you to a solution.

Community
  • 1
  • 1
sabbahillel
  • 4,357
  • 1
  • 19
  • 36