How do I write unknown keys to a CSV in large data sets?

Question

I'm currently working on a script that will query data from a REST API, and write the resulting values to a CSV. The data set could potentially contain hundreds of thousands of records, but it returns the data in sets of 100 entries. My goal is to include every key from each entry in the CSV.

What I have so far (this is a simplified structure for the purposes of this question):

import csv
resp = client.get_list()

while resp.token:
    my_data = resp.data
    process_data(my_data)
    resp = client.get_list(resp.token)

def process_data(my_data):
    #This section should write my_data to a CSV file
    #I know I can use csv.dictwriter as part of the solution
    #Notice that in this example, "fieldnames" is undefined
    #Defining it is part of the question
    with open('output.csv', 'a') as output_file:
        writer = csv.DictWriter(output_file, fieldnames = fieldnames)
        for element in my_data:
            writer.writerow(element)

The problem: Each entry doesn't necessarily have the same keys. A later entry missing a key isn't that big of a deal. My problem is, for example, entry 364 introducing an entirely new key.

Options that I've considered:

Whenever I encounter a new key, read in the output CSV, append the new key to the header, and append a comma to each previous line. This leads to a TON of file I/O, which I'm hoping to avoid.
Rather than writing to a CSV, write the raw JSON to a file. Meanwhile, build up a list of all known keys as I iterate over the data. Once I've finished querying the API, iterate over the JSON files that I wrote, and write the CSV using the list that I built. This leads to 2 total iterations over the data, and feels unnecessarily complex.
Hard code the list of potential keys beforehand. This approach is impossible, for a number of reasons.

None of these solutions feel particularly elegant to me, which leads me to my question. Is there a better way for me to approach this problem? Am I overlooking something obvious?

To do what you want _requires_ at least two passes over the data. The first to determine all the possible keys, and another to write them and the associated data to the csv file. You can optimize this by choosing an efficient data-structure to store the data in between the two passes (assuming you can't store it all in memory, which may be fast enough by itself that you won't need to optimize it). See [my answer](https://stackoverflow.com/a/18550652/355230) about using `tempfile.NamedTemporaryFile()` that might prove to be helpful in writing code that does this. — martineau, Sep 29 '17 at 22:41

John Carter · Accepted Answer · 2017-10-04T21:08:47.603

Options 1 and 2 both seem reasonable.

Does the CSV need to valid and readable while you're creating it? If not you could do the append of missing columns in one pass after you've finished reading from the API (which would be like a combination of the two approaches). If you do this you'll probably have to use the regular csv.writer in the first pass rather than csv.DictWriter, since your columns definition will grow while you're writing.

One thing to bear in mind - if the overall file is expected to be large (eg won't fit into memory), then your solution probably needs to use a streaming approach, which is easy with CSV but fiddly with JSON. You might also want to look into to alternative formats to JSON for the intermediate data (eg XML, BSON etc).

How do I write unknown keys to a CSV in large data sets?

1 Answers1