Apologies if this is very simple or has already been asked, I am new to Python and working with json files, so I'm quite confused.
I have a 9 GB json file scraped from a website. This data consists of information about some 3 million individuals. Each individual has attributes, but not all individuals have the same attributes. An attribute corresponds to a key in the json file, like so:
{
"_id": "in-00000001",
"name": {
"family_name": "Trump",
"given_name": "Donald"
},
"locality": "United States",
"skills": [
"Twitter",
"Real Estate",
"Golf"
],
"industry": "Government",
"experience": [
{
"org": "Republican",
"end": "Present",
"start": "January 2017",
"title": "President of the United States"
},
{
"org": "The Apprentice",
"end": "2015",
"start": "2003",
"title": "The guy that fires people"
}]
}
So here, _id
, name
, locality
, skills
, industry
and experience
are attributes (keys). Another profile may have additional attributes, like education
, awards
, interests
, or lack some attribute found in another profile, like the skills
attribute, and so on.
What I'd like to do is scan through each profile in the json file, and if a profile contains the attributes skills
, industry
and experience
, I'd like to extract that information and insert it into a data frame (I suppose I need Pandas for this?). From experience
, I would want to specifically extract the name of their current employer, i.e. the most recent listing under org
. The data frame would look like this:
Industry | Current employer | Skills
___________________________________________________________________
Government | Republican | Twitter, Real Estate, Golf
Marketing | Marketers R Us | Branding, Social Media, Advertising
... and so on for all profiles with these three attributes.
I'm struggling to find a good resource that explains how to do this kind of thing, hence my question.
I suppose rough pseudocode would be:
for each profile in open(path to .json file):
if profile has keys "experience", "industry" AND "skills":
on the same row of the data frame:
insert current employer into "current employer" column of
data frame
insert industry into "industry" column of data frame
insert list of skills into "skills" column of data frame
I just need to know how to write this in Python.