0

I am currently working on a project to build a database on professor's research paper database. This is my first time building a database(never had experience with MYSQL) and I am learning as I am doing it.

I was able to use an api to get the data, for example:

{"authorId": "1773022", "url": "https://www.semanticscholar.org/author/1773022", "papers": [{"paperId": "1253d2704580a74e776ae211602cfde71532c057", "title": "Nonlinear Schrodinger Kernel for hardware acceleration of machine learning"}, {"paperId": "71f49f1e3ccb2e92d606db9b3db66c669a163bb6", "title": "Task-Driven Learning of Spatial Combinations of Visual Features"}, {"paperId": "bb35ae8a50de54c9ca29fbdf1ea2fbbb4e8c4662", "title": "Statistical Learning of Visual Feature Hierarchies"}]}

How would I use python to turn this into a table so I can use it to build my database?

I am trying to make a table where columns are: Paper ID|Title|

Sang Park
  • 3
  • 1

2 Answers2

2

From https://www.w3schools.com/python/python_mysql_getstarted.asp


Installation

python -m pip install mysql-connector-python

Overview

Create a connection:

import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="yourusername",
  password="yourpassword"
)

Create a cursor to interact with the connection you made, then create a Database:

mycursor = mydb.cursor()

mycursor.execute("CREATE DATABASE mydatabase")

mydb.close()

After the database has been created, you can start connecting with it like so:

mydb = mysql.connector.connect(
  host="localhost",
  user="yourusername",
  password="yourpassword",
  database="mydatabase"
)

Note that you don't have to actually close your connection and reopen it to connect to that database, yet I don't see the docs mentioning anything about interacting with that specific database after you've created it, so I'm going to close it after I create the database...

Create the table with the proper datatypes and constraints:

mycursor.execute("CREATE TABLE customers (id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255), address VARCHAR(255))")

Then insert into it and commit the entries:

sql = "INSERT INTO customers (name, address) VALUES (%s, %s)"
val = ("John", "Highway 21")
mycursor.execute(sql, val)

mydb.commit()

Close the connection:

mydb.close()

Your Specific Case

This is the data you have provided:

data = {
    "authorId": "1773022",
    "url": "https://www.semanticscholar.org/author/1773022",
    "papers": [
        {
            "paperId": "1253d2704580a74e776ae211602cfde71532c057",
            "title": "Nonlinear Schrodinger Kernel for hardware acceleration of machine learning"
        }, 
        {
            "paperId": "71f49f1e3ccb2e92d606db9b3db66c669a163bb6",
            "title": "Task-Driven Learning of Spatial Combinations of Visual Features"
        },
        {
            "paperId": "bb35ae8a50de54c9ca29fbdf1ea2fbbb4e8c4662",
            "title": "Statistical Learning of Visual Feature Hierarchies"
        }
    ]
}

Granted that I don't know all the details, but based on the data given, I'd assume that you would want a table for:

  1. Authors - id (auto increment pk), authorId (varchar), url (varchar),
  2. Papers - id (auto increment pk), authorId (varchar fk), paperId, title (varchar)

Modify this as you please:

import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="yourusername",
  password="yourpassword",
  database="mydatabase"
)

mycursor = mydb.cursor()

# create the tables if they don't exist
authors = False
papers = False
mycursor.execute("SHOW TABLES")

for e in mycursor.fetchall():
    if "authors" == e[0]:
        authors = True
    if "papers" == e[0]:
        papers = True

if not authors:
    mycursor.execute("CREATE TABLE authors (id INT AUTO_INCREMENT PRIMARY KEY, authorId VARCHAR(255), url VARCHAR(255))")

if not papers:
    mycursor.execute("CREATE TABLE papers (id INT AUTO_INCREMENT PRIMARY KEY, authorId VARCHAR(255), paperId VARCHAR(255), title VARCHAR(255))")

# insert into them - recall that "data" is what gets returned by your api. I am assuming 1 entry.
a = data["authorId"]
u = data["url"]
ps = data["papers"]

# I am going to check if the authorId already exists.
mycursor.execute("SELECT * FROM authors WHERE authorId = '" + a + "'")
as = mycursor.fetchall()
if len(as) == 0:
    sql = "INSERT INTO authors (authorId, url) VALUES (%s, %s)"
    val = (a, u)
    mycursor.execute(sql, val)
    mydb.commit()

# for each paper in papers
for p in ps:
    # Consider this: Suppose you had multiple authors for 1 paper. Provided that they are entered into the authors table (which they should be), you can add an entry to the papers table for each author.
    # check to see if the paper is already in the database with the specific author, paperId, and title
    mycursor.execute("SELECT * FROM papers WHERE authorId = '" + a + "' AND paperId = '" + p["paperId"] + "' AND title = '" + p["title"] + "'")
    pc = mycursor.fetchall()
    # if they are no entries, insert it
    if len(pc) == 0:
        sql = "INSERT INTO papers (authorId, paperId, title) VALUES (%s, %s, %s)"
        val = (a, p["paperId"], p["title"])
        mycursor.execute(sql, val)
        mydb.commit()

mydb.close()

Side note:

I have not tested this. I will be amazed myself if this works first try. If you come across any errors, I can do my best to trouble shoot.


Pandas Dataframe .to_sql Method

Just in case this thread blows up with a bunch of answers, this answer on this thread uses a nice method from the pandas dataframe library and instead of plagiarizing it, I'd rather give credit where credit is due. I'd also like to include the docs and this detailed answer. It's probably a lot easier and is going to be a lot closer to an ORM (as I've mentioned below here) than my crude attempt to make one.


My Crude Attempt at an ORM

Once again, I have not tested this. It may not work, but the idea is there. Please check out this article: What are some good Python ORM solutions?

Save this as Models.py

import mysql.connector
import os


# --------------------------------------------------
# General Model
# --------------------------------------------------

# table columns is a dictionary that looks like:
# table_columns = {"column name": "constaints"}
# ex: table_columns = {"authorId": "VARCHAR(255)", "next col": "VARCHAR(255)"}, etc.
# data is the same as table_columns, but contains the data to be added for 1 row
class Model:
    def __init__(self, table_columns={}, table_name=None, data=None, host=os.environ["mysqlhost"], user=os.environ["mysqluser"], password=os.environ["mysqlpassword"], database=os.environ["database"]):
        self.db = None
        self.cursor = None
        self.table_name = table_name
        self.table_columns = table_columns
        self.data = data
        if host is not None and user is not None and password is not None and database is not None:
            try:
                self.db = mysql.connector.connect(
                    host=host,
                    user=user,
                    password=password,
                    database=database
                )
            except Exception as e:
                print(e)
                #print("One or more credentials were incorrect! Could not connect to you database!")
        if self.db is not None:
            self.cursor = self.db.cursor()
            if table_name is not None:
                self.cursor.execute("SHOW TABLES;")
                for e in self.cursor.fetchall():
                    if e[0] != self.table_name:
                        pk = [[e, self.table_columns[e]] for e in self.table_columns if "primary key" in e.lower()]
                        if len(pk) == 1:
                            pk = " ".join(pk) + ", "
                            del self.table_columns[pk[0]]
                        else:
                            pk = ""
                        try:
                            table_string = 'CREATE TABLE "' + self.table_name + '"(' + pk + ", ".join([" ".join(['"' + "_".join(c.split()) + '"', self.table_columns[c].upper()]) for c in self.table_columns]) + ');'
                            self.cursor.execute(table_string)
                            print("Created table with name: " + self.table_name)
                        except Exception as e:
                            self.db.rollback()
                            print(e)

    def insert(self):
        if self.data is not None:
            pkname = ""
            try:
                self.cursor.execute('SHOW KEYS FROM "(%s)" WHERE Key_name = (%s);', (self.table_name, 'PRIMARY'))
                pkname = self.cursor.fetchall()[0]
                if pkname in self.table_columns and pkname not in self.data:
                    del self.table_columns[pkname]
                elif pkname not in self.table_columns and pkname in self.data:
                    del  self.table_columns[pkname]
            except Exception as e:
                print("Could not get primary key name!")
                print(e)
            try:
                self.cursor.execute('SHOW COLUMNS FROM "' + self.table_name + '";')
                self.table_columns = {e: "" for e in self.cursor.fetchall()}
            except Exception as e:
                self.db.rollback()
                print("Could not find table with name " + self.table_name)
                print(e)
            flag = True
            for e in self.data:
                if e not in self.table_columns:
                    flag = False
            if flag:
                if len(self.data) == len(self.table_columns):
                    col = ["'" + e + "'" if e[0] != "'" and e[-1] !="'" else e for e in self.data]
                    data = [self.data[e] for e in self.data]
                    sql = "INSERT INTO %s (%s) VALUES (%s)"
                    val = ('"' + self.table_name + '"', ", ".join(col), ", ".join(data))
                    try:
                        self.cursor.execute(sql, val)
                        self.save()
                    except Exception as e:
                        print("Could not insert into " + self.table_name)
                        print(e)
            else:
                print("Found unexpected data. Try an insert or update query.")

    def save(self):
        committed = False
        try:
            self.db.commit()
            committed = True
        except Exception as e:
            self.db.rollback()
            print(e)
        if committed:
            self.db.close()
            self.db = None
            self.cursor = None

Usage:

import os
import Model as m

os.environ["mysqlhost"] = "host"
os.environ["mysqluser"] = "mysqluser"
os.environ["mysqlpassword"] = "password"
os.environ["database"] = "database"

# Presumably, this is exactly what your API returns.
data = {
    "authorId": "118985833",
    "url": "semanticscholar.org/author/118985833",
    "papers": [
        {
            "paperId": "0b698cd2269d9b52c75dbdf1994dbc9a98fb16c8",
            "title": "The development of self-care workshop to prevent vicarious traumatization in training therapists working with sexually exploited children"
        }
    ]
}

data_for_author = {"authorId": data["authorId"], "url": data["url"]}

# table_name=None
# table_columns={}
# data=None
model = m.Model(table_name="authors", data=data_for_author)
model.insert()

This may not even work - but I did it to give you some hope. There are easier ways to insert data into your database. You need to look into ORMs - the django (a webserver library for python) library has one native to it, that makes it super easy to manage your database.

Shmack
  • 1,933
  • 2
  • 18
  • 23
  • It worked!! Thank you so much! I had to add data = {...} and everything worked. I'm trying to add more professors info to the table; I tried it by just changing the data and it wouldn't add other info. How do I add more professors into the table? For example: I want to add {"authorId": "118985833", "url": "https://www.semanticscholar.org/author/118985833", "papers": [{"paperId": "0b698cd2269d9b52c75dbdf1994dbc9a98fb16c8", "title": "The development of self-care workshop to prevent vicarious traumatization in training therapists working with sexually exploited children"}]} to the table. – Sang Park Feb 18 '22 at 22:17
  • It worked first try!?!? Wow, I'm as impressed as you are! So if you are 100% certain it did - like it shows the entries in the db, then you should be able to change your data variable to have the same exact structure as the other, run the code, and save new data to the database. Django is a python web server library that has some very nice characteristics about it, one being that it has a easy built in way of dealing with databases through its "models" native to it. I will try to sit down and pump out some code to give you an easier way to interact with your db real quick. – Shmack Feb 19 '22 at 07:05
  • It may not be adding data into the table because it might be trying to add the tables again and then failing... To remedy that, for now you can add a mydb.rollback() before the `mycursor.execute("SHOW TABLES")` and comment everything out from there to the line just after the `if not papers:`, then give that a go. – Shmack Feb 19 '22 at 08:17
  • @SangPark please check the updated answer. – Shmack Feb 20 '22 at 21:09
  • Hey thank you so much! I was so caught up with other stuff that I wasn't able to check it past three days. Unfortunately the updated did not work but your comment way worked! Thank you so much! I was able to save so much time!! – Sang Park Feb 22 '22 at 06:30
  • @SangPark You're very welcome. I'd appreciate it if you could mark this answer as the correct answer, upvote it, or both. In the end, an ORM is going to be the best solution. When I get more time, I will provide a solution with one - I'm working on my own projects as well :) . – Shmack Feb 22 '22 at 06:35
0

Firstly download MySQL and run the below SQL on the MySQL database to create your MYSQL table

CREATE TABLE papers
(
paperId varchar(1024),
title varchar(4000)
);

Then below py code, connects to your MySQL database and inserts the records. Just modify the host,user,password,database to your database. Sorry don't have MySQL to test it but it should work. I used pandas to convert the API list to a dataframe so its easier to work with when inserting into the database table, hope it helps

import pandas as pd
import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="yourusername",
  password="yourpassword",
  database="mydatabase"
)

mycursor = mydb.cursor()

data = {"authorId":"1773022","url":"https://www.semanticscholar.org/author/1773022","papers":[{"paperId": "1253d2704580a74e776ae211602cfde71532c057", "title": "Nonlinear Schrodinger Kernel for hardware acceleration of machine learning"},{"paperId": "71f49f1e3ccb2e92d606db9b3db66c669a163bb6", "title": "Task-Driven Learning of Spatial Combinations of Visual Features"},{"paperId": "bb35ae8a50de54c9ca29fbdf1ea2fbbb4e8c4662", "title": "Statistical Learning of Visual Feature Hierarchies"}]}
df = pd.DataFrame(data)
papersdf = pd.DataFrame(df['papers'].tolist())

for index, row in papersdf.iterrows():
    sql = print('INSERT INTO papers(paperId,title) VALUES("' + row['paperId'] + '","' + row['title'] + '")')
    mycursor.execute(sql)
    mydb.commit()
    print(mycursor.rowcount, "record inserted.")
Tikkaty
  • 772
  • 1
  • 8
  • 24