Combine two python data processing scripts into a single work flow

Question

There's a data processing task I'm currently working on.

I have two python scripts each of which achieves a separate function but they operate on the same data, I think they can be combined into a single work flow, but I can't think of the most logical way to achieve this.

The data file is here, it's JSON, but it has two distinct components.

The first part looks like this:

{
    "links": {
        "self": "http://localhost:2510/api/v2/jobs?skills=data%20science"
    },
    "data": [
        {
            "id": 121,
            "type": "job",
            "attributes": {
                "title": "Data Scientist",
                "date": "2014-01-22T15:25:00.000Z",
                "description": "Data scientists are in increasingly high demand amongst tech companies in London. Generally a combination of business acumen and technical skills are sought. Big data experience ..."
            },
            "relationships": {
                "location": {
                    "links": {
                        "self": "http://localhost:2510/api/v2/jobs/121/location"
                    },
                    "data": {
                        "type": "location",
                        "id": 3
                    }
                },
                "country": {
                    "links": {
                        "self": "http://localhost:2510/api/v2/jobs/121/country"
                    },
                    "data": {
                        "type": "country",
                        "id": 1
                    }
                },

It's worked on by the first python script, here:

import json
from collections import defaultdict
from pprint import pprint

with open('data-science.txt') as data_file:
    data = json.load(data_file)

locations = defaultdict(int)

for item in data['data']:
    location = item['relationships']['location']['data']['id']
    locations[location] += 1

pprint(locations)

That renders data of this form:

These are location "id"s and the number of records that are assigned to that location.

The other part of the JSON object looks like this:

"included": [
    {
        "id": 3,
        "type": "location",
        "attributes": {
            "name": "Victoria",
            "coord": [
                51.503378,
                -0.139134
            ]
        }
    },

and is processed by this python file:

import json
from collections import defaultdict
from pprint import pprint

with open('data-science.txt') as data_file:
    data = json.load(data_file)

locations = defaultdict(int)

for record in data['included']:
    id = record.get('id', None)
    name = record.get('attributes', {}).get('name', None)
    coord = record.get('attributes', {}).get('coord', None)
    print(id, name, coord)

it outputs data in this format:

3 Victoria [51.503378, -0.139134]
1 United Kingdom None
71 data science None
32 None None
3 Victoria [51.503378, -0.139134]
1 United Kingdom None
1 data mining None
22 data analysis None
33 sdlc None
38 artificial intelligence None
39 machine learning None
40 software development None
71 data science None
93 devops None
63 None None
52 Cubitt Town [51.505199, -0.018848]

What I would actually like is that the final output would look something like this:

3, Victoria, [51.503378, -0.139134], 2673

Where 2673 references the job count from the first script.

If it doesn't have any coordinates, e.g. [51.503378, -0.139134] I can throw it away.

I'm sure it would be possible to combine these scripts together and get that output but I'm not such a comprehensive thinker and I can't figure out how to do it.

All the real project files live here.

score 1 · Accepted Answer · answered Nov 30 '16 at 19:33

Using functions is one way to combine the two scripts, after all they process the same data. So, you should make a function for each of the processing logic blocks, and then combine the results in the end:

import json
from collections import defaultdict
from pprint import pprint

def process_locations_data(data):
    # processes the 'data' block
    locations = defaultdict(int)
    for item in data['data']:
        location = item['relationships']['location']['data']['id']
        locations[location] += 1
    return locations

def process_locations_included(data):
    # processes the 'included' block
    return_list = []
    for record in data['included']:
        id = record.get('id', None)
        name = record.get('attributes', {}).get('name', None)
        coord = record.get('attributes', {}).get('coord', None)
        return_list.append((id, name, coord))
    return return_list    # return list of tuples

# load the data from file once
with open('data-science.txt') as data_file:
    data = json.load(data_file)

# use the two functions on same data
locations = process_locations_data(data)
records = process_locations_included(data)

# combine the data for printing
for record in records:
    id, name, coord = record
    references = locations[id]   # lookup the references in the dict
    print id, name, coord, references

The function could have better names, but this should achieve the unification you are looking for.

the script works but when you try to pipe it to an output file you get the error `UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 8: ordinal not in range(128)` — CMorales, Nov 30 '16 at 19:43
That has to do with the input data. It can be dealt with, read for example here: http://stackoverflow.com/questions/5760936/handle-wrongly-encoded-character-in-python-unicode-string But you could also just save to file instead of using the `print` in the last loop. — sal, Nov 30 '16 at 19:47

Combine two python data processing scripts into a single work flow

1 Answers1