1

I need to convert hierarchical data (AVRO data, which boils down to JSON) into tabular data (csv). Since AVRO have strict schema, I know essentially what form the JSON will take, but I have to do this for many different schema, so I'm looking for a consistent, declarative way to express the transformations I need to make. For example, if my incoming data looks like this…

{
    "customers": [
        {
            "addresses": [
                {
                    "city": "Los Angeles", 
                    "country": "USA", 
                    "county": null, 
                    "postalCode": "90064", 
                    "stateOrProvince": "California", 
                    "street1": "11832 W. Pico Blvd.", 
                    "street2": "", 
                    "street3": "", 
                    "street4": "", 
                    "tags": [
                        "BILLING"
                    ]
                }
            ], 
            "company": "", 
            "dateCreated": "2009-04-24T11:42:31+00:00", 
            "dateOfBirth": null, 
            "doNotCall": null, 
            "email": {
                "emailAddress": "general@magentocommerce.com"
            }, 
            "emailOptOut": null, 
            "fullName": {
                "firstName": "Test", 
                "lastName": "General", 
                "middleName": "", 
                "prefix": "", 
                "suffix": ""
            }, 
            "gender": null, 
            "id": {
                "Id": "2", 
                "namespace": "1000020016"
            }, 
            "lastModified": "2009-05-08T23:33:06+00:00", 
            "primaryPhone": {
                "number": "866.4.VARIEN", 
                "type": "UNKNOWN"
            }, 
            "sourceIds": null
        }
    ], 
    "totalItemsFound": 3
}

…I might need to output one row for each customer, like this:

MERCHANT ID|NUM CUSTOMERS|ID|FIRST NAME|LAST NAME|EMAIL|PHONE|STREET|CITY|STATE|ZIP|COUNTRY|EMAIL PREFERENCE
some.merch|3|1000020016-2|Test|General|general@magentocommerce.com|866.4.VARIEN|11832 W. Pico Blvd.|Los Angeles|California|90064|USA|N

I need to be able to express the following things:

  1. Get all the values from a given key as an array: All the dates-of-birth
  2. Repeat one value over every row: totalItemsFound, repeated in every row
  3. Repeat a static value in every row that comes from static data I already know the merchant channel never changes
  4. And the tricky one: Arbitrarily manipulate the incoming data to produce the desired output:
    • Convert the customer's id into namespace-id
    • Invert and change a null/boolean value into y/n, as in emailOptOut to EMAIL PREFERENCE
    • (re-)format a date or currency
    • etc

I started out with jsonpath, but that only solves #1 above. I've been slowly adding a language around jsonpath to serve 2 and 3, but I really don't have a good answer for 4 (besides eval., and I'd really hate to do that). I looked at JSON/T, but couldn't find a python library for it. I even seriously considered writing a middleware to convert the JSON into XML so that I could use XSLT, but I'm hoping someone here at S/O has a better solution before I get that desperate.

Community
  • 1
  • 1
kojiro
  • 74,557
  • 19
  • 143
  • 201

1 Answers1

1

Why not attempt a functional decomposition not unlike what follows:

w = csv.writer(...)
for r in records: 
    l = {}
    for field in fields:
        f_ = rename(field)
        v_ = transform(field, r.get(field, default(field)))
        l[f_] = v_
    w.write(l)

where rename maps the old field names to the new ones, and transform converts the field's value depending on the transform set for the field, and default returns the value to be assigned to this field.

So you would only need to define the list of fields, and the functions: rename, transform, and default.

For the example you've given:

def rename(field):
    t = {'emailOptOut':'EMAIL PREFERENCE'}
    return t.get(field, field)

def transform(field, data):
    t = {'emailOptOut': bool}
    return t.get(field, lambda a: a)(data)

def default(field)
    t = {'MERCHANT ID':11039215}
    return t.get(field, None)
Dan D.
  • 73,243
  • 15
  • 104
  • 123
  • Thanks, I have to think about this answer. Unfortunately for my usual S/O workflow, this question isn't one I can just look at an answer and decide it's right. :) – kojiro Jun 14 '12 at 22:03