1

To simplify, I have a list as follows:

lst = [ 
        {
            “person_id”: HZT998, “name”: ‘john’, “skills”: [‘python’, ‘sql’, ‘r’], 
            “extras”: {“likes_swimming”: False, “likes_cooking”: True}},
        {
            “person_id”: HTY954, “name”: ‘peter, “skills”: [‘python’, ‘r’, ‘c#’], 
            “extras”: {“likes_swimming”: True, “likes_cooking”: False}}
      ]

And I want to insert them to SQL tables as follows:

People table:

person_inner_id (PK) person_id name likes_swimming likes_cooking
1 HZT998 john False True
2 HTY954 peter True False

Skills table:

skill_id (PK) skill
1 python
2 sql
3 r
4 c#

Skills_People table

person_inner_id (FK) skill_id (FK)
1 1
1 2
1 3
2 1
2 3
2 4

So I want to flatten the inner dictionaries (insert them as columns), and the lists organize in a different table and create a relationship table. Also I dont want to use the 'person_id' column as my primary key because I feel its bad for data integrity to use an outside ID as the primary key. However this makes it much harder to implement using python, and I am not sure how to do so. I will also need to keep making these calls and inserting their output to the relevant tables.

I first tried dumping the entire original list into json with the open method:

with open("data.json", "w") as fp:
    json.dump(lst, fp)

and then I tried importing that json straight into sql through the mysql workbench table import wizard. This was successful in importing a general schema but not in inserting data, im guessing because of the nested dictionaries and lists that mysql doesn't know how to handle.

Thank you!

  • Check this thread for same issue in your question: https://stackoverflow.com/questions/4251124/inserting-json-into-mysql-using-python – Oghli Nov 19 '22 at 09:51

1 Answers1

0

Assuming that the key values in the list "lst" ​​(e.g. "person_id" etc.) are always present, you just need to modify the complex list into normalized lists for this 3 pieces of tables:

lst = [ 
        {
            "person_id": "HZT998", "name": "john", "skills": ["python", "sql", "r"], "extras": {"likes_swimming": False, "likes_cooking": True}
        }, 
        {
            "person_id": "HTY954", "name": "peter", "skills": ["python", "r", "c#"], "extras": {"likes_swimming": True, "likes_cooking": False}
        } 
    ]

data_for_first_table = []    # saves each user dictonary without 'skills'
data_dict_first_table = {}   # user-dictonary-instance without 'skills'
data_for_second_table = []   # save each 'skill' separately
data_for_third_table = []    # save each skill-user dictonary matches
data_dict_third_table = {}   # skill-user-dictonary-instance


######################################
# modify complex List for SQL-Qeruies
######################################
for entry in lst:
    for key in entry:
        if key == "skills":
            for skills in entry[key]:
                #print(skills)
                data_for_second_table.append(skills)
                # Third Table - "Match Table"
                data_dict_third_table[skills] = entry["person_id"]
        elif key == "extras":
            for extra in entry[key]:
                #print(extra + ": " + str(entry[key][extra]))
                data_dict_first_table[extra] = entry[key][extra]        
        else:
            #print(key + ": " + entry[key])
            data_dict_first_table[key] = entry[key]
        
    # store user-dictonary-instance without 'skills'
    dict_copy  = data_dict_first_table.copy()
    data_for_first_table.append(dict_copy)
    # store skill-user-dictonary-instance
    dict_copy  = data_dict_third_table.copy()
    data_for_third_table.append(dict_copy)
    

### remove duplicates from 2nd list ###
data_for_second_table = list(dict.fromkeys(data_for_second_table))
    
print("####### TEST1 #########")
print(data_for_first_table)
print("####### TEST2 #########")
print(data_for_second_table)
print("####### TEST3 #########")
print(data_for_third_table)

... and to avoid duplicate data-entries (e.g. if you get the same file twice) set People.person_id and Skills.skill to primary-key. Then you just need to make your SQL-queries:

######################
# Create SQL-Queries
######################
print("/* #### SQL-TEST #### */")
# INSERT(S)/UPDATE(S) for "Peaople"-Table
for entry in data_for_first_table:
    raw_sql1 = "INSERT INTO People ( " + str([k for k in entry]).replace("[","").replace("]","").replace("'","") + " ) VALUES ( " + str([str(entry[k]) for k in entry]).replace("[","").replace("]","") + " ) ON DUPLICATE KEY UPDATE " + str([ k + " = " + str(entry[k]) for k in entry if k != 'person_id'] ).replace("[","").replace("]","").replace("'","").replace("= ","= '").replace(",","',") + "';"
    print(raw_sql1)

# ONLY INSERT(S) for "Skills"-Table
for entry in data_for_second_table:
    raw_sql2 = "INSERT IGNORE INTO Skills ( skill ) VALUES ( '" + entry + "' );"
    print(raw_sql2)

# ONLY INSERT(S) for "Skills_People"-Table  
for entry in data_for_third_table:
    for multiple_entry in entry:
        raw_sql3= "INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( '" + entry[multiple_entry]  + "', '" + multiple_entry + "' );"
        print(raw_sql3)

Note: This only works if the table-column-names exactly match the list-key-names. In addition, the keys and values (Type = 'CHAR') ​​were modified with simple Python-string-operations ".replace()" so that a valid SQL query is generated.

/* #### SQL-TEST #### */
INSERT INTO People ( person_id, name, likes_swimming, likes_cooking ) VALUES ( 'HZT998', 'john', 'False', 'True' ) ON DUPLICATE KEY UPDATE name = 'john', likes_swimming = 'False', likes_cooking = 'True';
INSERT INTO People ( person_id, name, likes_swimming, likes_cooking ) VALUES ( 'HTY954', 'peter', 'True', 'False' ) ON DUPLICATE KEY UPDATE name = 'peter', likes_swimming = 'True', likes_cooking = 'False';
INSERT IGNORE INTO Skills ( skill ) VALUES ( 'python' );
INSERT IGNORE INTO Skills ( skill ) VALUES ( 'sql' );
INSERT IGNORE INTO Skills ( skill ) VALUES ( 'r' );
INSERT IGNORE INTO Skills ( skill ) VALUES ( 'c#' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HZT998', 'python' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HZT998', 'sql' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HZT998', 'r' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HTY954', 'python' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HZT998', 'sql' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HTY954', 'r' );
INSERT IGNORE INTO Skills_People ( person_id, skill_id ) VALUES ( 'HTY954', 'c#' );
  • Hi thanks a lot for your answer! But just a general question, doesn’t the fact that the person_id column came from an outside API call means it is bad for data integrity to use the person_id as the primary key in my SQL database? That is the reason in table 1 the primary key is an auto incremented column called ‘person_inner_id’ and not the original person_id. However, creating an inner id in python poses a new challenge because it needs to be made in python while taking into consideration the ids that are currently in the database. What is your opinion on that? Thanks again! – Jonathan Oren Nov 20 '22 at 02:22
  • If the record for `person_id = 'HZT998'` has to be updated and you don't use `person_id` as the primary key, you would still have to check for the API value `person_id` to find the correct data-entries in your database. Update example: “person_id”: HZT998 decides today that he likes to swim. When you don't use `person_id` as primary-key, you will get two entries in your first table: person_inner_id (PK) = 1 | person_id = HZT998 | ... | likes_swimming = False person_inner_id (PK) = 3 | person_id = HZT998 | ... | likes_swimming = True – Andreas Hauser Nov 21 '22 at 11:28
  • Otherwise you would have to get your API partner to work with your primary-keys (e.g. UUID etc.). That means you would have to rely even more on the API partner. – Andreas Hauser Nov 21 '22 at 11:39