There's a data processing task I'm currently working on.
I have two python scripts each of which achieves a separate function but they operate on the same data, I think they can be combined into a single work flow, but I can't think of the most logical way to achieve this.
The data file is here, it's JSON, but it has two distinct components.
The first part looks like this:
{
"links": {
"self": "http://localhost:2510/api/v2/jobs?skills=data%20science"
},
"data": [
{
"id": 121,
"type": "job",
"attributes": {
"title": "Data Scientist",
"date": "2014-01-22T15:25:00.000Z",
"description": "Data scientists are in increasingly high demand amongst tech companies in London. Generally a combination of business acumen and technical skills are sought. Big data experience ..."
},
"relationships": {
"location": {
"links": {
"self": "http://localhost:2510/api/v2/jobs/121/location"
},
"data": {
"type": "location",
"id": 3
}
},
"country": {
"links": {
"self": "http://localhost:2510/api/v2/jobs/121/country"
},
"data": {
"type": "country",
"id": 1
}
},
It's worked on by the first python script, here:
import json
from collections import defaultdict
from pprint import pprint
with open('data-science.txt') as data_file:
data = json.load(data_file)
locations = defaultdict(int)
for item in data['data']:
location = item['relationships']['location']['data']['id']
locations[location] += 1
pprint(locations)
That renders data of this form:
1: 6,
2: 20,
3: 2673,
4: 126,
5: 459,
6: 346,
8: 11,
9: 68,
10: 82,
These are location "id"
s and the number of records that are assigned to that location.
The other part of the JSON object looks like this:
"included": [
{
"id": 3,
"type": "location",
"attributes": {
"name": "Victoria",
"coord": [
51.503378,
-0.139134
]
}
},
and is processed by this python file:
import json
from collections import defaultdict
from pprint import pprint
with open('data-science.txt') as data_file:
data = json.load(data_file)
locations = defaultdict(int)
for record in data['included']:
id = record.get('id', None)
name = record.get('attributes', {}).get('name', None)
coord = record.get('attributes', {}).get('coord', None)
print(id, name, coord)
it outputs data in this format:
3 Victoria [51.503378, -0.139134]
1 United Kingdom None
71 data science None
32 None None
3 Victoria [51.503378, -0.139134]
1 United Kingdom None
1 data mining None
22 data analysis None
33 sdlc None
38 artificial intelligence None
39 machine learning None
40 software development None
71 data science None
93 devops None
63 None None
52 Cubitt Town [51.505199, -0.018848]
What I would actually like is that the final output would look something like this:
3, Victoria, [51.503378, -0.139134], 2673
Where 2673
references the job count from the first script.
If it doesn't have any coordinates, e.g. [51.503378, -0.139134]
I can throw it away.
I'm sure it would be possible to combine these scripts together and get that output but I'm not such a comprehensive thinker and I can't figure out how to do it.
All the real project files live here.