0

Man, was I having trouble with how to word the title.

  1. Summary: For a database project in uni, we have to import 1 million rows of data into a database, where each row represents an article scraped from the internet. One of the columns in this data, is the author of the article. As many articles were written by the same author, I wanted to create a table separate from the articles, that linked each unique author to a primary key and then I wanted to replace the author string in the article table, with the key for that author in the other table. How is this done in the most efficient way and is it possible to do it in a way that ensures a deterministic output, in that a specific author string would ALWAYS map to a certain pkey, no matter the order the article rows "come in" when this method creates that table.

    1. What I've done: The way I did it, was to (in Python using Pandas), go through all 1 million article rows and make a unique list of all the authors I found. Then I created a dictionary based on this list (sorted). Then I used this dictionary to replace the author string in the articles tables with a key corresponding to a specific author, and then used the dict to create my authors table. However, as I see it, if a row was inserted into my data with an author not found the first time around, it could mess with the alphabetical order my method adds the authors to the dict in, thus making it not-so-deterministic. So, what do people normally do in these instances? Can SQL on the 1mio articles directly make a new authors table with unique authors and keys, and replace the author string in the articles table? Could it be an idea to use hashing with a specific hash key to ensure a certain string always maps to a certain key, or?

    2. Show some code:

def get_authors_dict():

    authors_lists = []

    df = pd.read_csv("1mio-raw.csv", usecols=['authors'], low_memory=True)

    unique_authors_list = df['authors'].unique()

    num_of_authors = len(unique_authors_list)

    authors_dict = {}
    i = 0
    prog = 0
    for author in unique_authors_list:
        try: 
            authors_dict[author]
            i += 1
        except KeyError:
            authors_dict[author] = i
            i += 1

        print(prog / num_of_authors * 100, "%")
        prog += 1

    return authors_dict

authors_dict = get_authors_dict()

col1_author_id = list(authors_dict.values())
col2_author_name = list(authors_dict.keys())

data_dict = {'col1': col1_author_id,
             'col2': col2_author_name}

df = pd.DataFrame(data=data_dict, columns=['col1', 'col2'])

df.to_csv('author.csv', index=False, header=False, sep="~")

f = open('author.csv', encoding="utf8")

conn = psycopg2.connect(--------)
cur = conn.cursor()
cur.copy_from(f, 'author', sep='~')
conn.commit()
cur.close()

# Processing all the 1mio rows again in seperate file
# and making changes to the dataframe using the dict:
sample_data['authors'] = sample_data['authors'].map(authors_dict)
pydrink
  • 341
  • 2
  • 3

1 Answers1

0

So if i understand you correctly, you want to create a SQL table which connects authors to articles. Your problem is, that you do not know what primary key you should use in such a table, since an author might have written more than one article.

In this cases, instead of trying to do something clever, i would just use a composite primary key for your table. This means you define the author row in connection with the title/publishing date/identfier of the article as the primary key for the table. Thus, each row of your table has a unique identifier (if no author has written two identical articles). This is independent of your python code, as this needs to be defined in the database. This question might help you to define a composite primary key.

SilverMonkey
  • 1,003
  • 7
  • 16