Loading data from dict with nested dicts and lists flattened or as many-to-many tables to sql

Question

To simplify, I have a list as follows:

lst = [ 
        {
            “person_id”: HZT998, “name”: ‘john’, “skills”: [‘python’, ‘sql’, ‘r’], 
            “extras”: {“likes_swimming”: False, “likes_cooking”: True}},
        {
            “person_id”: HTY954, “name”: ‘peter, “skills”: [‘python’, ‘r’, ‘c#’], 
            “extras”: {“likes_swimming”: True, “likes_cooking”: False}}
      ]

And I want to insert them to SQL tables as follows:

People table:

person_inner_id (PK)	person_id	name	likes_swimming	likes_cooking
1	HZT998	john	False	True
2	HTY954	peter	True	False

Skills table:

skill_id (PK)	skill
1	python
2	sql
3	r
4	c#

Skills_People table

person_inner_id (FK)	skill_id (FK)
1	1
1	2
1	3
2	1
2	3
2	4

So I want to flatten the inner dictionaries (insert them as columns), and the lists organize in a different table and create a relationship table. Also I dont want to use the 'person_id' column as my primary key because I feel its bad for data integrity to use an outside ID as the primary key. However this makes it much harder to implement using python, and I am not sure how to do so. I will also need to keep making these calls and inserting their output to the relevant tables.

I first tried dumping the entire original list into json with the open method:

with open("data.json", "w") as fp:
    json.dump(lst, fp)

and then I tried importing that json straight into sql through the mysql workbench table import wizard. This was successful in importing a general schema but not in inserting data, im guessing because of the nested dictionaries and lists that mysql doesn't know how to handle.

Thank you!

Check this thread for same issue in your question: https://stackoverflow.com/questions/4251124/inserting-json-into-mysql-using-python — Oghli, Nov 19 '22 at 09:51

Andreas Hauser · Answer 1 · 2022-11-19T15:10:41.817

Assuming that the key values in the list "lst" (e.g. "person_id" etc.) are always present, you just need to modify the complex list into normalized lists for this 3 pieces of tables:

lst = [ 
        {
            "person_id": "HZT998", "name": "john", "skills": ["python", "sql", "r"], "extras": {"likes_swimming": False, "likes_cooking": True}
        }, 
        {
            "person_id": "HTY954", "name": "peter", "skills": ["python", "r", "c#"], "extras": {"likes_swimming": True, "likes_cooking": False}
        } 
    ]

data_for_first_table = []    # saves each user dictonary without 'skills'
data_dict_first_table = {}   # user-dictonary-instance without 'skills'
data_for_second_table = []   # save each 'skill' separately
data_for_third_table = []    # save each skill-user dictonary matches
data_dict_third_table = {}   # skill-user-dictonary-instance


######################################
# modify complex List for SQL-Qeruies
######################################
for entry in lst:
    for key in entry:
        if key == "skills":
            for skills in entry[key]:
                #print(skills)
                data_for_second_table.append(skills)
                # Third Table - "Match Table"
                data_dict_third_table[skills] = entry["person_id"]
        elif key == "extras":
            for extra in entry[key]:
                #print(extra + ": " + str(entry[key][extra]))
                data_dict_first_table[extra] = entry[key][extra]        
        else:
            #print(key + ": " + entry[key])
            data_dict_first_table[key] = entry[key]
        
    # store user-dictonary-instance without 'skills'
    dict_copy  = data_dict_first_table.copy()
    data_for_first_table.append(dict_copy)
    # store skill-user-dictonary-instance
    dict_copy  = data_dict_third_table.copy()
    data_for_third_table.append(dict_copy)
    

### remove duplicates from 2nd list ###
data_for_second_table = list(dict.fromkeys(data_for_second_table))
    
print("####### TEST1 #########")
print(data_for_first_table)
print("####### TEST2 #########")
print(data_for_second_table)
print("####### TEST3 #########")
print(data_for_third_table)

... and to avoid duplicate data-entries (e.g. if you get the same file twice) set People.person_id and Skills.skill to primary-key. Then you just need to make your SQL-queries:

######################
# Create SQL-Queries
######################
print("/* #### SQL-TEST #### */")
# INSERT(S)/UPDATE(S) for "Peaople"-Table
for entry in data_for_first_table:
    raw_sql1 = "INSERT INTO People ( " + str([k for k in entry]).replace("[","").replace("]","").replace("'","") + " ) VALUES ( " + str([str(entry[k]) for k in entry]).replace("[","").replace("]","") + " ) ON DUPLICATE KEY UPDATE " + str([ k + " = " + str(entry[k]) for k in entry if k != 'person_id'] ).replace("[","").replace("]","").replace("'","").replace("= ","= '").replace(",","',") + "';"
    print(raw_sql1)

# ONLY INSERT(S) for "Skills"-Table
for entry in data_for_second_table:
    raw_sql2 = "INSERT IGNORE INTO Skills ( skill ) VALUES ( '" + entry + "' );"
    print(raw_sql2)

# ONLY INSERT(S) for "Skills_People"-Table  
for entry in data_for_third_table:
    for multiple_entry in entry:
        raw_sql3= "INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( '" + entry[multiple_entry]  + "', '" + multiple_entry + "' );"
        print(raw_sql3)

Note: This only works if the table-column-names exactly match the list-key-names. In addition, the keys and values (Type = 'CHAR') were modified with simple Python-string-operations ".replace()" so that a valid SQL query is generated.

/* #### SQL-TEST #### */
INSERT INTO People ( person_id, name, likes_swimming, likes_cooking ) VALUES ( 'HZT998', 'john', 'False', 'True' ) ON DUPLICATE KEY UPDATE name = 'john', likes_swimming = 'False', likes_cooking = 'True';
INSERT INTO People ( person_id, name, likes_swimming, likes_cooking ) VALUES ( 'HTY954', 'peter', 'True', 'False' ) ON DUPLICATE KEY UPDATE name = 'peter', likes_swimming = 'True', likes_cooking = 'False';
INSERT IGNORE INTO Skills ( skill ) VALUES ( 'python' );
INSERT IGNORE INTO Skills ( skill ) VALUES ( 'sql' );
INSERT IGNORE INTO Skills ( skill ) VALUES ( 'r' );
INSERT IGNORE INTO Skills ( skill ) VALUES ( 'c#' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HZT998', 'python' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HZT998', 'sql' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HZT998', 'r' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HTY954', 'python' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HZT998', 'sql' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HTY954', 'r' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HTY954', 'c#' );

Hi thanks a lot for your answer! But just a general question, doesn’t the fact that the person_id column came from an outside API call means it is bad for data integrity to use the person_id as the primary key in my SQL database? That is the reason in table 1 the primary key is an auto incremented column called ‘person_inner_id’ and not the original person_id. However, creating an inner id in python poses a new challenge because it needs to be made in python while taking into consideration the ids that are currently in the database. What is your opinion on that? Thanks again! — Jonathan Oren, Nov 20 '22 at 02:22
If the record for `person_id = 'HZT998'` has to be updated and you don't use `person_id` as the primary key, you would still have to check for the API value `person_id` to find the correct data-entries in your database. Update example: “person_id”: HZT998 decides today that he likes to swim. When you don't use `person_id` as primary-key, you will get two entries in your first table: person_inner_id (PK) = 1 | person_id = HZT998 | ... | likes_swimming = False person_inner_id (PK) = 3 | person_id = HZT998 | ... | likes_swimming = True — Andreas Hauser, Nov 21 '22 at 11:28
Otherwise you would have to get your API partner to work with your primary-keys (e.g. UUID etc.). That means you would have to rely even more on the API partner. — Andreas Hauser, Nov 21 '22 at 11:39

Loading data from dict with nested dicts and lists flattened or as many-to-many tables to sql

1 Answers1