3

Apologies if this is very simple or has already been asked, I am new to Python and working with json files, so I'm quite confused.

I have a 9 GB json file scraped from a website. This data consists of information about some 3 million individuals. Each individual has attributes, but not all individuals have the same attributes. An attribute corresponds to a key in the json file, like so:

{
  "_id": "in-00000001",
  "name": {
    "family_name": "Trump",
    "given_name": "Donald"
  },
  "locality": "United States",
  "skills": [
    "Twitter",
    "Real Estate",
    "Golf"
     ],
  "industry": "Government",
  "experience": [
  {
    "org": "Republican",
    "end": "Present",
    "start": "January 2017",
    "title": "President of the United States"
  },
  {
    "org": "The Apprentice",
    "end": "2015",
    "start": "2003",
    "title": "The guy that fires people"
  }]
}

So here, _id, name, locality, skills, industry and experience are attributes (keys). Another profile may have additional attributes, like education, awards, interests, or lack some attribute found in another profile, like the skills attribute, and so on.

What I'd like to do is scan through each profile in the json file, and if a profile contains the attributes skills, industry and experience, I'd like to extract that information and insert it into a data frame (I suppose I need Pandas for this?). From experience, I would want to specifically extract the name of their current employer, i.e. the most recent listing under org. The data frame would look like this:

    Industry   | Current employer | Skills
    ___________________________________________________________________
    Government | Republican       | Twitter, Real Estate, Golf
    Marketing  | Marketers R Us   | Branding, Social Media, Advertising

... and so on for all profiles with these three attributes.

I'm struggling to find a good resource that explains how to do this kind of thing, hence my question.

I suppose rough pseudocode would be:

for each profile in open(path to .json file):
    if profile has keys "experience", "industry" AND "skills":
        on the same row of the data frame:
            insert current employer into "current employer" column of 
            data frame
            insert industry into "industry" column of data frame
            insert list of skills into "skills" column of data frame

I just need to know how to write this in Python.

Data
  • 689
  • 7
  • 23
  • 1
    I edited my answer many times. Please let me know if it is helpful. – Attersson Oct 12 '19 at 22:08
  • 1
    @Attersson Of course, thanks for taking the time to answer my question. It is late where I live, so I will carefully read through it tomorrow and let you know how it goes. – Data Oct 12 '19 at 22:12

1 Answers1

2

I assume the file contains all the profiles, such as

{
    "profile 1" : {
        # Full object as in the example above
    },
    "profile 2" : {
        #Full object as in the example above
    }
}

Before continuing, let me show a correct way to use Pandas DataFrames.

Example of better usage of Pandas DataFrames:

Values in a Pandas DataFrame cannot be lists. So we will have to duplicate lines as shown in the example below. Check this question and JD Long's answer for more detail: how to use lists as values in pandas dataframe?

ID      |    Industry   | Current employer | Skill
___________________________________________________________________
in-01   |    Government | Republican       | Twitter
in-01   |    Government | Republican       | Real Estate
in-01   |    Government | Republican       | Golf
in-02   |    Marketing  | Marketers R Us   | Branding
in-02   |    Marketing  | Marketers R Us   | Social Media
in-02   |    Marketing  | Marketers R Us   | Advertising

Find explainations within comments in the code below:

import json
import pandas as pd

# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill']) 

#Load the file as json. 
with open(path to .json file) as file:
    #readlines() reads the file as string and loads() loads it into a dict
    obj = json.loads(''.join(file.readlines()))
    #Then iterate its items() as key value pairs
    #But the line of code below depends on my first assumption.
    #Depending on the file format, the line below might have to differ.
    for prof_key, profile in obj.items():
        # Verify if a profile contains all the required keys
        if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
            for skill in profile["skills"]:
                df.loc[-1] = [profile["_id"],
                              profile["industry"],
                              [x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
                              skill]

The above line, df.loc[-1] = ... inserts a row in the dataframe as last row (index -1).

When later you wish to use this information, you will have to use df.groupby('ID')

Let me know if you have different format(s) in your file(s) and if this explaination is sufficient to get you started or you need more.

Community
  • 1
  • 1
Attersson
  • 4,755
  • 1
  • 15
  • 29
  • Hi, so it seems the problem with this solution is that my json file is 9GB, so I get a memory error when executing this line: `json.loads(file.readlines())`. Is there some way to modify the code so that it reads a part of the file, works with it, and then repeats on another part of the file? – Data Oct 13 '19 at 12:55
  • To answer your question about valid values of `start` and `end`: essentially, anything considered to be a "current position" will have an `end` value of `"present"`. If the `end` value is not `"present"`, then it can be ignored. Is that enough information? – Data Oct 13 '19 at 13:00
  • Looking at the documentation for `readlines()`, I can see this function takes a "sizehint" as an input, i.e. the number of bytes to be read from the file. So is it possible to read, say, the first 500MB, process it, and then read the next 500MB,... and so on? – Data Oct 13 '19 at 13:06
  • Having tried to read the first 50000000 bytes, I get this error: `obj = json.loads(file.readlines(50000000)) # 50 MB File "C:\Users\Jake\AppData\Local\Programs\Python\Python37-32\lib\json\__init__.py", line 341, in loads raise TypeError(f'the JSON object must be str, bytes or bytearray, ' TypeError: the JSON object must be str, bytes or bytearray, not list` – Data Oct 13 '19 at 13:09
  • Ok you can read a single line with `readline`. It is not so trivial however to parse the file. You should know what to read, when to stop and what to parse into a json everytime... Also, I am not sure what your file format is... if you got bytearrays and must `decode` / `encode` or just `str()`. Another approach is `read(nbytes)` – Attersson Oct 13 '19 at 16:43
  • Or iterate the lines as written here https://stackoverflow.com/questions/6475328/how-can-i-read-large-text-files-in-python-line-by-line-without-loading-it-into – Attersson Oct 13 '19 at 16:44
  • As for the list error, it is very likely for readlines() to have returned a list of strings, so join it. (editing my answer). `''.join(your_list_of_strings)` – Attersson Oct 13 '19 at 16:46
  • Hey it's okay, I've worked around the problems mentioned in my comments, and I used your answer as the basis for solving all of my requirements. So I'll accept your answer. Also, I found a way to insert a list into a pandas dataframe cell. – Data Oct 13 '19 at 16:50
  • Nice to know. You can edit my answer if you like so we can help future readers... I still recommend not inserting the list in the cell and using the groupby, for a series of reasons... but if this worked for you, that's great! – Attersson Oct 13 '19 at 17:30