Man, was I having trouble with how to word the title.
Summary: For a database project in uni, we have to import 1 million rows of data into a database, where each row represents an article scraped from the internet. One of the columns in this data, is the author of the article. As many articles were written by the same author, I wanted to create a table separate from the articles, that linked each unique author to a primary key and then I wanted to replace the author string in the article table, with the key for that author in the other table. How is this done in the most efficient way and is it possible to do it in a way that ensures a deterministic output, in that a specific author string would ALWAYS map to a certain pkey, no matter the order the article rows "come in" when this method creates that table.
What I've done: The way I did it, was to (in Python using Pandas), go through all 1 million article rows and make a unique list of all the authors I found. Then I created a dictionary based on this list (sorted). Then I used this dictionary to replace the author string in the articles tables with a key corresponding to a specific author, and then used the dict to create my authors table. However, as I see it, if a row was inserted into my data with an author not found the first time around, it could mess with the alphabetical order my method adds the authors to the dict in, thus making it not-so-deterministic. So, what do people normally do in these instances? Can SQL on the 1mio articles directly make a new authors table with unique authors and keys, and replace the author string in the articles table? Could it be an idea to use hashing with a specific hash key to ensure a certain string always maps to a certain key, or?
Show some code:
def get_authors_dict():
authors_lists = []
df = pd.read_csv("1mio-raw.csv", usecols=['authors'], low_memory=True)
unique_authors_list = df['authors'].unique()
num_of_authors = len(unique_authors_list)
authors_dict = {}
i = 0
prog = 0
for author in unique_authors_list:
try:
authors_dict[author]
i += 1
except KeyError:
authors_dict[author] = i
i += 1
print(prog / num_of_authors * 100, "%")
prog += 1
return authors_dict
authors_dict = get_authors_dict()
col1_author_id = list(authors_dict.values())
col2_author_name = list(authors_dict.keys())
data_dict = {'col1': col1_author_id,
'col2': col2_author_name}
df = pd.DataFrame(data=data_dict, columns=['col1', 'col2'])
df.to_csv('author.csv', index=False, header=False, sep="~")
f = open('author.csv', encoding="utf8")
conn = psycopg2.connect(--------)
cur = conn.cursor()
cur.copy_from(f, 'author', sep='~')
conn.commit()
cur.close()
# Processing all the 1mio rows again in seperate file
# and making changes to the dataframe using the dict:
sample_data['authors'] = sample_data['authors'].map(authors_dict)